Many biomedical applications, such as phenotyping  and tracking , rely on instance segmentation, which aims not only to group pixels in semantic categories but also to segment individuals from the same category. This task is challenging because objects of the same class can get crowded together without obvious boundary clues.
A prevalent class of approaches used for biomedical images is based on semantic segmentation, obtaining instances through per-pixel classification [3, 4]. Although this approach generates good object coverage, crowded objects are often mistakenly regarded as one connected region. DCAN  predicts the object contour explicitly to separate touching glands. However, segmentation by contours is very unreliable in many cases, since a few misclassified pixels can break a continuous boundary.
Another major class of approaches, such as Mask-RCNN , refine the bounding boxes obtained from object detection methods [5, 6]. Object detection methods rely on non-maximum suppression (NMS) to remove duplicate predictions resulting from exhaustive search. This becomes problematic when bounding boxes of two objects overlap with a large ratio: one valid object will be suppressed. A finer shape representation star-convex polygons is used by  with the intention of reducing false suppression. However, it is only suitable for roundish objects [8, 9].
In this work, we propose to get instances by grouping pixels based on an object-aware embedding. A deep neural network is trained to assign each pixel an embedding vector. Pixels of the same object will have similar directions in the embedding space, while spatially close objects are orthogonal to each other. Since our method performs pixel-level grouping, it is not affected by different object shape and it does not suffer from the false suppression problem. On the other hand, it avoids the fusion of adjacent objects like the semantic segmentation based methods.
Some recent research [12, 10, 13] proposes the use of embedding vectors to distinguish individual objects in the driving scene and natural images. These approaches force each object to occupy a different part of the embedding space. The global constraint is actually not necessary, and could even be detrimental, for biomedical images that often contain repeated local patterns. For example, content in the receptive fields of pixel X and Y (Fig. 1(a)) are very similar, both with one object above and one below. The network has no clear clue to assign X and Y different embeddings. Forcing them to be different is likely to hinder training. Furthermore, the global constraint is inefficient in terms of embedding space utilization. There is no risk of distant objects being merged, thus they could share the same embedding space, such as B and D in Fig. 1.
The main contributions of our work are as follows: (1) we propose to train the embedding mapping only constraining adjacent objects to be different, (2) a novel loss of a good geometrical explanation (adjacent instances live in orthogonal space), (3) a multi-task network head for embedding training and obtaining segmentations from embeddings, which can be applied to any backbone networks.
Our method is compared with several strong competing approaches. It yields comparable or better results on two data sets: a combined fluorescence microscopy data set of BBBC06222https://data.broadinstitute.org/bbbc/BBBC006/ and the part of DSB2018333https://www.kaggle.com/c/data-science-bowl-2018 used by  and the CVPPP2017444https://www.plant-phenotyping.org/CVPPP2017-challenge leaf segmentation data set.
Our approach has has two output branches taking the same feature map as input: the embedding branch and the distance regression branch (Fig. 2). Both consist of two convolutional layers. The last layer of the embedding branch uses linear activation and each filter outputs one dimension of the embedding vector.
The distance regression branch has a single layer output with relu activation. We regress the distance from an object pixel to the closest boundary pixel (normalized within each object). The distance map is used to help obtain segmentations from the embedding map, details are depicted in Section2.2.
The background is treated as a standalone object that is adjacent to all other objects. For distance regression, background pixels are set to zero. It is worth mentioning that the distance map alone provides enough cue to separate objects. But we argue that it is not optimal to obtain accurate segmentations since both object and background pixels are of small values around the boundaries, which is ambiguous and sensitive to small perturbations. In this work, the distance regression plays the role of roughly locating the objects.
2.1 Loss function
The training loss consists of two parts: and , which supervise the learning of the distance regression branch and the embedding branch separately. We use to give more emphasis on the embedding training.
We minimize the mean squared error for the distance regression, with each pixel weighted to balance the foreground and background frequency.
Intuitively, embeddings of the same object should end up at similar positions in the embedding space, while different objects should be discriminable. So naturally, the embedding loss is formulated as the sum of two terms: the consistency term and the discriminative term .
To give a specific formula, we have to determine how ”similarity” is measured. While euclidean distance is used by many works [10, 11], we construct the loss with cosine distance, which decouples from the output range of different networks: , where are embeddings of pixel and . The outcome of cosine distance ranges from 0 meaning exactly the same direction, to 2 meaning the opposite, with 1 indicating orthogonality.
Instead of pushing each object pair as far as possible [10, 13, 11] in the embedding space (global constraint), we only push adjacent objects into each other’s orthogonal space (local constraint). As shown in Fig. 1, far away objects can occupy the same position in the embedding space, which uses the space more effectively. In the embedding map in Fig. 2, only a few colors appears repeatedly, still ensuring that adjacent objects have different colors.
Let’s say that there are K objects within an image with pixels respectively. The loss can be written as follows:
, where and are the regression output and ground truth of pixel p, is the embedding of pixel , is the mean embedding (normalized) of object , is the factor for balancing the foreground and background frequency. indicates the neighbors of object . An object is considered as a neighbor if its shortest distance to object k is less than .
Since objects form clusters in the embedding space, a clustering method that does not require to specify the number of clusters (e.g. mean shift ) can be employed to obtain segmentations from the embedding. However, due to the time complexity of mean shift, even processing medium-size images takes tens of seconds. Since our embedding space has a good geometric explanation, we propose a simple but effective way to obtain segmentations:
Threshold the distance map to get the central region of an object. We use in our experiment.
Compute the mean embedding of each seed region.
Iteratively perform morphological dilation with a 3x3 kernel. Frontier pixels are included into the object, if it is not assigned to other objects and is smaller than .
Stop when no new pixels are included.
Threshold is determined based on the fact that a pixel embedding should be closer to the ground truth object than any others in terms of angle. Thus, we set the midpoint as the boundary,
3.1 Data sets and evaluation metrics
In order to compare different methods, we chose two data sets that reflect typical phenomena in biomedical images:
BBBC006+partDSB2018: We combined the fluorescence microscopy images of cells used by  (part of DSB2018\getrefnumberds_dsb\getrefnumberds_dsbfootnotemark: ds_dsb) and BBBC0006\getrefnumberds_bbbc\getrefnumberds_bbbcfootnotemark: ds_bbbc
. BBBC006 is a larger data set containing more densely distributed cells. We removed a small number of images without objects or with obvious labeling mistakes. The data were randomly split into 1003 training images and 230 test images. The evaluation metric was theaverage precision (AP) over a range of (intersection over union) thresholds from 0.5 to 0.95 \getrefnumberds_dsb\getrefnumberds_dsbfootnotemark: ds_dsb.
CVPPP2017: Compared to the roundish cells, the leaves in CVPPP2017 have more complex shapes and exhibit more overlap or contact. We randomly sampled 648 images for training and 162 images for testing. The results were evaluated in terms of symmetric best dice (SBD), foreground-background dice (FBD), difference in count (DiC) and absolute DiC .
3.2 Competing methods
Unet: We employed the widely used Unet  to perform 3-label segmentation (object, contour, background). Since many objects are in contact, we introduced a 2-pixel boundary to separate them.
Mask-RCNN: Mask-RCNN  localizes objects by proposal classification and non-max suppression (NMS). Afterwards, segmentation is performed on each object bounding box. We generated 1000 proposals with anchor scales (8, 16, 32, 64, 128) for the cell data set and 50 proposals with scales (16, 32, 64, 128, 256) for the leaf data set. The NMS threshold was set to 0.9 for both data sets.
Stardist: Star-convex polygons are used by  as a finer shape representation. Without an explicit segmentation step, the final segmentation is obtained by combining distances from center to boundary in 32 radial directions. The final step of Stardist consists of NMS to suppress overlapping polygons.
For comparability, all methods except Mask-RCNN used a simplified U-net  (3 pooling and 3 upsampling) as the backbone network and trained from scratch. Mask-RCNN (ResNet-101  backbone) was fine-tuned on the basis of a model pretrained with the MS COCO data set555http://cocodataset.org/#home.
3.3 Results and discussion
Th Unet had the lowest mean in Tab. 1. The value decreased rapidly with increasing because of the false fusion of adjacent cells. Both Stardist and Mask-RCNN can handle most adjacent objects, but when a few cells form a tight roundish cluster, both methods are likely to fail. Mask-RCNN yielded the best score in the high range, which is the benefit of an explicit segmentation step: masks are better aligned with the object boundary. Qualitative results in Fig. 3 show that our method is better at distinguishing objects that are in contact. This is also reflected by the highest of our method for .
The leaf segmentation results better reflect the characteristics of each approach. As shown in Fig. 3, the Unet outlines the leaves accurately, but merges several instances into one (green and yellow). All other approaches proved to be object-aware. However, Mask-RCNN missed leaf B, because the bounding box of B is almost identical to that of A. Stardist avoids such false suppression by using a better shape representation, which comes at the expense of losing finer structures, such as the petioles. This is easy to understand, since Stardist obtains a mask by fitting a polygon based on discrete radial directions. In contrast, our method does not only avoids misses, but also produces a good contour.
Local vs. global constraint: To demonstrate the effect of local constraint, we tested the method with different : larger treats more objects as neighbors (large enough is equivalent to the global constraint). The best result is always achieved at , which only takes objects in contact or almost in contact as neighbors. In the case of dimension 4, the performance drop on the cell data set is especially significant at due to the inefficient use of embedding space. The same drop happens at on the leaf segmentation data set.
Incomplete object mask: Inconsistent embeddings within an object (Fig. 4) sometimes occurs near the boundary, leading to incomplete segmentations. This is why our method performs not as good as Mask-RCNN in high range. The reason of the inconsistence deserves further study.
4 Conclusion and outlook
Our proposed approach can not only outline objects accurately, but also is free from false object suppression and object fusion. The local constraint (orthogonality of neighboring objects) makes full use of the embedding space and gives a good geometric interpretation. Our method is especially attractive for images containing a large number of objects that are repeated and in contact and yields state-of-the-art results even with a light-weighted backbone network.
Since our approach generates embeddings that live in orthogonal spaces, if this space can be aligned with the standard space by rotating, segmentations can directly obtained from embeddings. An alternative approach to bypass postprocessing would be to add sparsity constraints on the embedding vector during training. We will test the feasibility of these two methods in the future.
-  Scharr, H., Minervini, M., French, A.P., Klukas, C., Kramer, D.M., Liu, X., Luengo, I., Pape, J., Polder, G., Vukadinovic, D., Yin, X., Tsaftaris, S.A.: Leaf segmentation in plant phenotyping: a collation study. Machine Vision and Applications, 27(4), 585–606 (2016)
-  Ulman, V., Maška, M., Magnusson, K.E.G., Ronneberger, O., Haubold, C., Harder, N., Matula, P., Matula, P., Svoboda, D., Radojevic, M., Smal, I., Rohr, K., Jaldén, J., Blau, H.M., Dzyubachyk, O., Lelieveldt, B., Xiao, P., Li Y., Cho, S.Y., Dufour, A., Olivo-Marin, J.C., Reyes-Aldasoro, C.C., Solis-Lemus, J.A., Bensch, R., Brox, T., Stegmaier, J., Mikut, R., Wolf, S., Hamprecht, F.A., Esteves, T., Quelhas, P., Demirel, Ö., Malmström, L., Jug, F., Tomancák, P., Meijering, E., Muñoz-Barrutia, A., Kozubek, M., Ortiz-de-Solor, C.: An Objective Comparison of Cell-tracking Algorithms. Nature Methods, 14, 1141-1152 (2017)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: 2015 MICCAI, 234–241
-  Chen, H., Qi, X., Yu L., Dou, Q., Qin, J., Heng, P.A.: DCAN: Deep Contour-Aware Networks for Accurate Gland Segmentation. In: 2016 CVPR, 2487-2496
-  Ren, S., He, K, Girshick, R., Sun, J,: Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In: 28th NIPS, 91-99
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C.Y., Berg, A.C.: SSD: Single Shot MultiBox Detector. In: 2016 ECCV, 21-37
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 ICCV, 2980-2988
-  Schmidt, U., Weigert, M., Broaddus, C., Myers, E.W.: Cell Detection with Star-Convex Polygons. In: 2018 MICCAI, 265–273
-  Jetley, S., Sapienza, M., Golodetz, S., Torr, P.H.: Straight to shapes: Real-time detection of encoded shapes. In: 2017 CVPR, 4207-4216
De Brabandere, B., Neven, D., Van Gool, L.: Semantic Instance Segmentation with a Discriminative Loss Function. In: 2017 CoRR
-  Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song H.O., Guadarrama, S., Murphy, K.P.: Semantic Instance Segmentation via Deep Metric Learning. In: 2017 CoRR
-  De Brabandere, B., Neven, D., Van Gool, L.: Semantic Instance Segmentation for Autonomous Driving. In: 2017 CVPR Workshop, 478–480
-  Kong, S., Fowlkes, C.C.: Recurrent Pixel Embedding for Instance Grouping. In: 2018 CVPR, 9018–9028
-  Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619 (2002)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: 2016 CVPR, 770-778