Instance segmentation is a notoriously hard problem in the diverse and challenging conditions of real-world environments. Yet, it is a fundamental ability for autonomous robots as it enables safe and robust decision-making under the large uncertainty of the real-world. Autonomous robots need to detect, outline and track animate and inanimate objects in real-time to decide how to act next. Unlike category-level segmentation, instance segmentation aims to provide detailed information about the location, geometry and number of individual objects. Instance segmentation is particularly challenging in scenarios where objects are heavily overlapping with each other. Furthermore, an image can contain an arbitrary number of object instances. The labeling of these instances has to be permutation-invariant, i.e. it does not matter which specific label an instance receives, as long as the label is different from all other object instance labels.
Current state of the art approaches mainly operate on single RGB images. They had enormous success by leveraging large-scale, annotated datasets  and large-capacity, deep neural networks . The most successful approaches such as Mask R-CNN  rely on a region proposal process. A first module generates object proposals in the form of 2D bounding boxes. These bounding boxes serve as input to a module performing object recognition and segmentation within these boxes. Such networks are challenged in cluttered scenarios with heavily overlapping objects where the region proposal may already contain multiple instances (for an example see Fig. 1).
Inspired by the benefits of including depth data in robot perception tasks  , we propose a novel instance segmentation method that avoids region proposals and therefore the aforementioned challenges. In our approach, the receptive field of each prediction is not constrained to the detection box of the corresponding instance. In this way, it can utilize more global information than region-proposal based method that are restricted to the inside of the proposed region. Central to our approach is (i) to exploit depth data and (ii) an explicit embedding in a feature space related to 3D object properties. This model is not merely a variation of Mask R-CNN  but rethinks how depth data can be best exploited.
Specifically, we use the first and second order moments of the object occupancy function to represent an object instance. It enables us to transform the problem of instance-level segmentation into a general regression problem. We train an hourglass Deep Neural Network (DNN) where each pixel in the output votes for the relative position of the corresponding object center (first order moment) and for the object’s size and pose (second order moment). The final instance segmentation is achieved through clustering in the space of moments. The carefully-designed, object-centric training loss is defined on the output of the clustering. We show that the combination of the proposal-free method with depth data leads to improved accuracy and robustness of instance segmentation under challenging real-world conditions. Furthermore, the proposed explicit embedding could be more readily used in a robot manipulation scenario.
Our primary contributions are: (1) proposing a novel deep learning framework for 3D instance segmentation based on RGB-D images. (2) providing an extensive quantitative evaluation that shows state-of-the-art performance on a difficult, synthetic dataset targeted at robotic manipulation and validate our design choices as well as qualitative results on real data for understanding the strengths and limitations of our method.
Ii Related Work
Instance segmentation methods can be divided into two main approaches: methods based on region proposals and proposal-free methods. Approaches based on region proposals first generate these proposals (e.g. in the form of bounding boxes). Then in these regions, objects are detected and segmented. Proposal-free approaches drop the proposal process. They either sequentially segment object instances or use different object representations. In the following, we review approaches in each of the two categories.
Ii-a Region Proposal Based Approaches
Hariharan et al. 
propose one of the first works that address instance-level segmentation and uses the region proposal pipeline. It employs a convolutional neural network to generate bounding boxes which are then classified into object categories. Tightness of the boxes is improved bounding box regression. A class-specific segmentation is then performed in this bounding box to simultaneously detect and segment the object.He et al.  use Fast-RCNN boxes  and build a multi-stage pipeline to extract features, classify and segment the object. Gupta et al.  performs instance segmentation based on region proposals in RGB-D images. The authors use decision forests to predict a foreground mask in a detection bounding box.
Approaches based on region proposals achieve state-of-the-art results on large-scale datasets such as for example COCO  and Cityscapes . Such approaches are however challenged when facing small objects or objects that are strongly occluding each other. In these cases, a bounding box may contain multiple objects and it becomes difficult to detect and segment the correct one as visualized in Fig. 1). However, occlusions are very common especially in robotic manipulation with scenarios such as bin-picking or when cleaning a cluttered counter top.
Ii-B Region Proposal Free Approaches
Proposal-free approaches differ from each other mainly in the employed object representation. De Brabandere et al.  train a neural network learning pixel features of instances in an unsupervised fashion based on a discriminative loss 
. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance are close to each other while different instances are separated by a wide margin.Newell et al.  propose an approach that teaches a network to simultaneously output detections and group assignments through implicit embedding. Different from these implicit embedding methods, our work leverages the explicit embedding which is built upon object geometry that we qualitatively show to have a small transfer gap to real data. Furthermore, the attributes (object center and object covariance matrix encoding pose and shape) are useful features for robotic manipulation.
Liang et al. 
predict pixel-wise feature vectors representing the ground truth bounding box of the instance it belongs to. Leveraging another sub-network that predicts an object count, they cluster the output of the network into individual instances using spectral clustering. However predicting the exact number of instances is difficult and a wrongly predicted number harms the segmentation result. We propose to use object features such as its 3D centroid and pose that are predicted per pixel. These form the input to the clustering method to infer the final number of instances.
Iii Technical Approach
We are proposing a 3D region-proposal-free approach towards instance segmentation. Specifically, we propose a new object representation that is predicted per pixel. This is followed by a clustering steps in this new feature space. The training loss is defined on the output of the clustering.
Iii-a Object Representation
To avoid the permutation-invariance problem in instance level segmentation, we use a new representation to indicate individual object instances. Let be a point cloud containing 3D points recorded from the camera frame while looking at . Let denote the bounding box center of Object with respect to the camera frame. Let denote the second order moments of where and As object feature, we define
Note that encodes object location relative to camera frame while encodes object size and pose. If a pixel in the input image shows object , then the correct output feature is . We assume that different object instances will have different features with respect to the camera frame . Therefore, pixels that show the same object will have the same object feature value while pixels belonging to different objects will differ in feature values. Three example scenes are visualized in Fig. 2. We propose a neural network model to make per-pixel predictions of object features.
Iii-B Instance Segmentation Process
The features predicted by our model are an approximation of the ground truth value. Each pixel (u,v) that corresponds to the same object will have values that differ from ground truth by some . The predicted object feature values form a cluster around the ground truth value under the error distribution . We propose to learn a model that provides the input to a clustering process in that feature space. Specifically, the model learns to predict each cluster’s centroid and radius.
Inspired by region proposals , our model also outputs an image denoted by . Each pixel
in this image contains the probabilitythat it is the cluster centroid. To generate the ground truth of , we sort pixels representing object in the RGB images by their pixel distance to the object’s centroid in ascending order. The top 10% - 30% of the ground truth object pixels per object in the input image will be annotated as cluster centroid candidates annotated as 1. The rest of the pixels are annotated as 0.
Let be an additional output image of our model. Pixel contains a scalar value
. This value is a radius estimate of the sphere that encloses all pixels which belong to the same object. The sphere is centered at. Any pixel at whose falls inside the sphere will be segmented as the same object . Any pixel at whose falls outside the sphere will be segmented as an object different from . To generate the ground truth , each pixel (u,v) representing object is annotated by half of the minimum distance between the object feature and of all the other objects in the image:
Given the predicted and , we can now perform multi-object segmentation as visualized in Fig. 3. There are two stages of clustering. For initialization, pixel (u,v) with the maximum probability of being an object centroid is chosen first. Given a sphere centered at with radius , all pixels with a feature enclosed by this sphere are assigned to object . All pixel assigned to are removed from the set of unsegmented pixels before segmenting the next object. From the remaining pixels, the one with the highest is used as the seed for segmenting . This process is repeated until all foreground pixels are assigned to an object. After this initialization, we refine the resulting segmentation running one iteration of a Gaussian Mixture Model (GMM). The number of mixture components equals the number of object clusters. The covariance matrix of each mixture component equals the covariance matrix of the predicted object feature values in each cluster.
Iii-C ClusterNet Architecture
Our proposed model is shown in Fig. 4. It takes three images as input: RGB, XYZ and depth. XYZ images are transformed from depth images using the camera intrinsics and therefore seem like redundant information. However, we will show in the experimental section that there is a gain in including both. RGB and XYZ are fed into a pre-trained ResNet50  encoder.
We extract their level features with size of and level features with size of denoted as , and , respectively. Then and are concatenated and fed into another convolution layer to fuse them. The output is denoted as . and are concatenated to and fed into an Atrous Spatial Pyramid Pooling (ASPP) module  to enlarge the receptive field of each cell within of the feature map. The output is up-sampled to denoted as to have the same size as . Depth images are fed into a VGG architecture  and fed into another ASPP module  denoted as . Then , and are concatenated as the final encoding stage .
Intuitively, the cluster radius could be inferred from global information like distances and poses between nearby objects. Therefore, is directly decoded from . Cluster centroid probability images , mask images and object feature images are decoded from . For cluster centroid probability images and mask images, the neural net model decodes them to ,
respectively and then up-samples them by 4 using bilinear interpolation. For object feature images the neural network model decodes them directly to the original size.
Iii-D Loss function
We define the following object-centric training loss:
where the s weight each loss term. Note that all pixel-wise loss terms , , and are computed only on the ground truth foreground pixels.
Iii-D1 Semantic Mask Loss
is the cross entropy loss between the ground truth and estimated semantic segmentation. In our experiment, there are only 2 classes (foreground/background). If a pixel is the projection of an object point, we assign 1 as ground truth; otherwise 0.
Iii-D2 Cluster Center Loss
Cross-entropy loss is used to learn the probability of a pixel to be the object center.
Iii-D3 Pixel-wise Loss
We use a pixel-wise loss on the object feature and the enclosing sphere radius . For each attribute, we use the L2-norm to measure and minimize the error between predictions and ground truth. Note that the loss on each attribute is also differently weighted. We denote their corresponding weights and , respectively.
Iii-D4 Variance Loss
We use to encourage pixels belonging to the same object to have similar object features
and thereby to reduce their variance.
where is the mean value of over all pixels belonging to .
Iii-D5 Violation Loss
penalizes pixels that are not correctly segmented. Any predicted feature that is more than away from the ground truth will be pushed towards the ground truth feature by the violation loss:
is a hyperparameter to define the range of violation. In our experiments we foundto work well.
Iv-a Data Set
For evaluation, we use the synthetic dataset by Shao et al. . It contains RGB-D images of scenes with a large variety of rigid objects. This dataset is very relevant to robotic manipulation research as it contains a wide variety of graspable objects and is recorded with an RGB-D camera that is very common on manipulation robots. Objects are also strongly occluding each other such that this dataset exposes limitations of current state-of-the-art approaches based on region-proposals.
The dataset is generated from 31594 3D object mesh models from ShapeNet  covering 28 categories. The models are split into a training, validation and test set with 21899, 3186 and 6509 objects respectively. For each scene, 1-30 object models are randomly selected. 49988, 6720, 14372 images are synthesized using models from training, validation and test sets respectively.
Iv-B Evaluation Metrics
We adopt the standard evaluation metrics like average precision (AP) and average recall (AR) as used for the COCO dataset. The average precision is measured based on intersection over union (IoU) between predicted segmentation and ground truth segmentation. and is the average precision based on an IoU threshold of and respectively. is the average of where x ranges from 0.5 to 0.95 in steps of . The average recall is measured based on the maximum object segmentation candidates allowed. , and is calculated based on , and maximum object segmentation candidates allowed. Objects are classified to be small, medium and large objects if their bounding box areas are within the range of , and , respectively (units are pixels). AP and AR calculated on small, medium and large objects are denoted as , , and , and . We also define an object occlusion score. It constitutes the number of pixels showing the occluded object divided by the number of object pixels if the object was not occluded. Objects are classified to be under heavy, medium and little occlusion if their occlusion score is within the range of , and , respectively. AR calculated on objects under heavy, medium and little occlusion is denoted as , and .
We compare our method with Mask R-CNN  that only uses RGB images as input. As backbone, we choose ResNet-50-C4 denoted as Mask R-CNN(C4) and ResNet-50-FPN denoted as Mask R-CNN(FPN). We use the Detectron  implementation and its default parameters. The experiment is running on two Nvidia P100 GPUs with batch size 2 per GPU for 200000 iterations. We change the number of classes to be 2 only containing Foreground and Background and use RLE format  as the ground truth annotation format.
Iv-D Quantitative Evaluations of Instance Segmentation
We refer to the network in Fig. 4 as Our(c+cov) where c+cov represents using both object center and object covariance matrix as object feature . We also compare to a variant of our proposed architecture Our(c) where the object feature only contains its 3D centroid. This provides insight into the impact of using the second order moments as object features. Furthermore, we run three ablation studies to demonstrate the impact of different input image modalities. We denote Our(rgbdc) as neural network that only uses RGB and depth images as the input. Our(rgbc) denotes the model that use only RGB images as input. Our(rgbxyzc) use both RGB and point clouds as inputs. Our(rgbdc), Our(rgbc) and Our(rgbxyzc) all use only the object centroid as object feature .
For training, we use the Adam optimizer  with its suggested default parameters of and along with a learning rate . We use a batch size of 4 image pairs. The input RGB-D images have a resolution of . The loss weights, as defined in Sec. III-D, are set to , , , , =100.0, and . The predicted object features
becomes more accurate after epoch 5. Then we setand to increase the impact of the variance and violation loss. The results in terms of instance segmentation accuracy are shown in Table I. Our proposed models Ours(c) and Ours(c+cov) outperform the aforementioned Mask R-CNN  approaches by a large margin. The results in terms of segmentation accuracy under different levels of occlusions are reported in Table II. A few example results are shown in Fig. 5. Compared to the Mask R-CNN  baselines, our models improve the average recall of objects under all levels of occlusions.
Iv-D1 Effects of using different input image modalities
We report the instance segmentation performance in Table I. Our(rgbc) has the lowest performance of all models including the Mask R-CNN  baselines that only take RGB images as input. This is because the 3D centroid of an object is hard to reconstruct from RGB images. Including depth data (Our(rgbdc)) already improves the performance by a large margin compared with Our(rgbc). Including point cloud information (Our(rgbxyzc)) improves the instance segmentation performance further and outperforms MaskR-CNN significantly. This indicates that point clouds provide strong cues about 3D object instances and that our model leverages these cues well. Our(c) relies on both depth images and point clouds. Compared with Our(rgbxyzc), Our(c) achieves improved results especially on large objects ( and ). It demonstrates that depth images provide useful cues to segment large objects. This is probably due to their large variance in the X,Y channel of the point clouds.
Iv-D2 Effects of using different object features
Our(c) shows better performance than Our(c+cov). Predicting second order moments is more difficult than predicting the object centroid. This is suggested by the high variance of errors when predicting this feature. Additionally, there are only few objects in the dataset that expose situations as shown in the middle and right of Fig 2.
Iv-E Qualitative Evaluation on Real World Data
We demonstrate the network’s ability to perform instance segmentation in a real world setting. We recorded real RGB-D data with the Intel RealSense SR300 Camera. It was captured using a diverse set of objects with varying geometries, textures, and colors. Note that we do not have any ground truth annotations.
We compare the performance of Mask R-CNN(C4)  and our method Our(c). Note that neither model is fine-tuned to transfer from synthetic to real data. The qualitative results are shown in Fig. 6. Our method generalizes well and is able to accurately segment object instances. However Mask R-CNN(C4) segments the whole image as an instance and fails to detect and segment any objects indicating poor generalization ability.
We proposed ClusterNet, a model towards 3D instance segmentation of an unknown number of objects in RGB-D images. In our approach, instance segmentation is formulated as a regression problem towards 3D object features that is followed by clustering. Specifically, the model makes pixel-wise predictions of the first and second order moment of the object that the pixel belongs to. Then, sequential clustering is performed in this feature space to infer the object instances. Through this formulation, we showed how our model can leverage RGB and depth data to achieve robust 3D instance segmentation in challenging, cluttered scenes with heavy occlusions. While these quantitative results were on synthetic data, we also showed how our model transfers to similarly challenging real world data. As future work, we aim to evaluate this model also on datasets for autonomous driving such as Cityscapes  or ApolloScape . We would also like to integrate this approach with semantic segmentation that is currently restricted to foreground/background segmentation.
- Chang et al.  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical report, 2015.
- Chen et al.  Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
Cordts et al. [2016a]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.In
- Cordts et al. [2016b] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016b.
- De Brabandere et al.  Bert De Brabandere, Davy Neven, and Luc Van Gool. Semantic instance segmentation with a discriminative loss function. In Deep Learning for Robotic Vision, workshop at CVPR 2017, pages 1–2. CVPR, 2017.
- Girshick et al.  Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018.
- Gupta et al.  Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision, pages 345–360. Springer, 2014.
- Han et al.  Jungong Han, Ling Shao, Dong Xu, and Jamie Shotton. Enhanced computer vision with microsoft kinect sensor: A review. IEEE transactions on cybernetics, 43(5):1318–1334, 2013.
- Hariharan et al.  Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In European Conference on Computer Vision, pages 297–312. Springer, 2014.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- He et al.  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
- Huang et al.  Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. The apolloscape dataset for autonomous driving. CoRR, abs/1803.06184, 2018.
- Kingma and Ba  D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Int. Conf. on Learning Representations (ICLR), 2015.
- Liang et al.  Xiaodan Liang, Liang Lin, Yunchao Wei, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Proposal-free network for instance-level semantic object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Lin et al.  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Newell et al.  Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, pages 2277–2287, 2017.
Ng et al. 
Andrew Y Ng, Michael I Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.In Advances in neural information processing systems, pages 849–856, 2002.
- Ren et al.  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 2015.
Schroff et al. 
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clustering.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
- Shao et al.  Lin Shao, Parth Shah, Vikranth Dwaracherla, and Jeannette Bohg. Motion-based object segmentation based on dense rgb-d scene flow. arXiv preprint arXiv:1804.05195, 2018.
- Simonyan and Zisserman  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.