1 Introduction
Instance segmentation aims to label each individual object, which is critical to many biological and medical applications, such as plant phenotyping and cell quantification. Learning objectaware pixel embeddings is one of the trends in the field of instance segmentation. The embedding is essentially a highdimensional representation of each pixel. To achieve instance segmentation, pixel embeddings of the same object should be located relatively close in the learned embedding space, while those of different objects should be discriminable.
The loss usually consists of two terms: the betweeninstance loss term and the withininstance loss term . The former term encourages differentinstance embeddings to be located far away from each other, while the latter term encourages sameinstance embeddings to stay together. Two most popular metrics used to describe the similarities of embeddings are Euclidean distance and cosine distance. Although the pixel embedding approaches have gained success in many datasets including CVPPP Leaf Segmentation Challenge [4, 5, 12, 16], the trained embedding space is far from optimal.
Our idea was indirectly inspired by the “easy task first” concept behind curriculum learning [1]. Distance regression predicts the distance from a pixel to the object boundary and is used in [4, 20], for example, as an auxiliary module. We have empirically found that the distance regression module is relatively easy to train on many datasets. Considering that the learned features by the distance regression module should be already recognizable for distinguishing instances, we prefix the embedding module with a distance regression module to promote the embedding learning process.
The main contributions of this paper are summarized as follows:

We propose an architecture to promote the pixel embedding learning by utilizing features learned from the distance regression module, which significantly improves the performance in the CVPPP Leaf Segmentation Challenge [19]. Our overall mean Symmetric Best Dice (mSBD) score is at the top position of the leaderboard with 0.879 by paper submission. Furthermore, the average of mSBD scores on Arabidopsis images (testing sets A1, A2, A4) outperforms the second best results from three different teams by over 3%, namely from 0.883 to 0.917;

We conduct a number of ablation experiments in terms of the stacked UNet architecture, different types of concatenative layers and varied loss formats, to validate our architecture and also supplement some experimental vacancies in this field.
2 Related Work
We roughly categorize some approaches of instance segmentation into two groups with respect to the overall pipeline: instancefirst approaches and onestage approaches. Instancefirst approaches exploit the instancelevel bounding boxes from the firststage object detector. For example, Mask RCNN [7] uses RPN [18], and recent methods like BlenderMask [3] and CenterMask [10] are based on the anchorfree detector FCOS [21]. Pixellevel segmentations are then produced through subjoined refinement modules. Mask RCNN [7] constructs a lightweight segmentation network with consecutive convolutional layers, while the Blender Module and Spatial AttentionGuided Mask (SAGMask) are proposed in [3] and [10], respectively, for a more accurate segmentation.
In contrast, onestage approaches predict the existence (objectness) and mask of objects all at once. Masks are represented as polar coordinates in [20, 25]. Specifically, the model regresses the distances to the boundary along a set of fixed directions at each location. To describe more complex shapes, masks are encoded with a linear projection in [27].
Furthermore, the approaches based on pixel embedding learning, which also belong to onestage approaches, are becoming a new trend. They share the general pipeline of embedding and clustering
. Each pixel of input images is mapped to a highdimensional vector (embedding), in which pixels of the same object are located closely. Then, clustering in the embedding space results in the final instance segmentation. De Brabandere and Neven
[5, 12] have proposed Euclidean distance based embedding loss for instance segmentation. Payer et al. [16]have demonstrated embedding loss which utilizes cosine similarity and recurrent stacked hourglass network
[13]. Chen et al. [4] have introduced a UNet based architecture of two heads, where the embeddings are trained with cosine embedding loss and local constraints. These two heads are distance regression head and embedding head. The distance regression head aims to provide seed candidates for clustering. Our proposed method inherits the fundamental modules from this work.3 Method
Our network consists of two cascaded parts (Fig. 1): the distance regression module and the embedding module. Each module uses a UNet architecture with a 32dimensional output feature map as the backbone network. The learned distance and embedding feature maps are denoted as Dfeat. and Efeat., respectively.
The distance regression module takes standardized images (by linearly scaling each image to have mean 0 and variance 1) as the inputs and outputs the distance map (abbreviated as
distmapin the following context) through a single convolutional layer with ReLU activation. The ground truth distmap is obtained by computing the shortest distances from pixels to the object boundary and then being normalized instancewise with respect to the maximal value. The distance regression module is trained with Mean Squared Error (MSE) loss in this work, which is illustrated as
Dloss in Fig. 1.Distance feature map Dfeat. learned by the distance regression module is fed to the embedding module together with the input image by concatenation. Details of the concatenation are introduced in Section 3.2. The final embeddings are obtained through a convolutional layer with linear activation, followed by L2 normalization. The embedding module is trained with the loss based on the cosine similarity and local constraints (Section 3.1), denoted as Eloss in Fig. 1.
The embedding space trained with loss in Eq. 1 has a comprehensive geometric interpretation: embedding vectors of neighboring objects tend to be orthogonal, which simplifies the complexity of clustering. The fast angular clustering can be effortlessly performed based on angles between embedding vectors. Firstly, seeds are obtained from distmaps by fetching local maxima with a trivial threshold (selected as 70% of the global maximum in an image). After that, all neighboring pixels within the angular range of a seed are collected to form a cluster. In this work, we use for all experiments. At last, the labels outside of the officially provided ground truth foreground masks are omitted.
3.1 Cosine Embedding Loss with Local Constraints
For the embedding module training, we build upon the loss format from [4]. The training loss, denoted as Eloss in Fig. 1, is defined based on the cosine similarity and is formularized as:
(1)  
where the embedding loss is defined as the weighted sum of the betweeninstance loss term and withininstance loss term with the weighting factor . and represents the pixel embedding vector and the mean embedding of an object, respectively. denotes the number of objects, while the number of pixels of a single object is denoted as . represents the set of neighboring objects around the object and is the number of neighbors.
The betweeninstance loss term encourages the embeddings of different object to be separated, while the withininstance loss term punishes the case where pixel embeddings of the same object diverge from the mean. In addition, the local constraints of this loss only force neighboring objects to form separable clusters in the embedding space. The benefits of local constraints and the comparison with the global constraint are demonstrated in Section 4.3.
3.2 Feature Concatenative Layer
The feature map Dfeat. learned by the distance regression module is firstly transformed to the desired dimensions (shown with an example of 32 in Fig. 1) via a convolutional layer and then L2 normalized pixelwise along through the feature channels before being concatenated to the images. Our experiment shows that the feature map normalization is critical to a stable training process.
As illustrated in Fig. 2, the difference between leaf boundary and leaf midvein (primary vein) is ambiguous. The learned embeddings by the UNet architecture [4] often fail at those locations. However, the distmaps are able to tell the difference with lower values representing leaf boundaries and higher values representing leaf midveins. From another perspective, the distmap, which gives an approximate outline of objects, can be interpreted as a objectness
score, the pixelwise probability about existence of object. In addition, as proposed by
[14], mixing convolutional operations with the pixel location helps constructing dense pixel embeddings that can separate object instances. From this perspective, the distance regression features can indirectly provide location information to the subsequent module.To this end, we construct a twostage architecture, as depicted in Fig. 1, by forwarding the distance regression features to the embedding module. And the concatenation of the distance regression features and images can bring in best performance in the experiments. We term the distance features as concatenative layer in between the stacked UNets as intermediate distance regression supervision.
In the experiments, other different features have also been tested to forward: the 1dimensional distmap, 8dimensional distance features, 32dimensional distance features, 32dimensional embedding features, concatenated 16dimensional distance features and 16dimensional embedding features. Inspired by [12, 14], we have also evaluated the performance of augmenting the input image with x and ycoordinates.
3.3 From UNet to WNet
We abbreviate the proposed network as WNet to differ from the existing UNet with two heads, although the novelty and characteristic are not fully represented: the distance regression features as intermediate supervision and the cosine embedding loss with local constraints.
In Fig. 3, the detailed architectures of UNet with two heads and WNet with intermediate distance regression supervision are illustrated. The parallel distance and embedding heads of UNet (Fig. 2(a)) are modified towards the serial distance and embedding modules in WNet (Fig. 2(b)). Apart from the types of concatenative layer as discussed previously, we have also investigated the final dimensions of embeddings as another hyperparameter, denoted as embedding_dim in Fig. 2(b). The corresponding ablation experiments can be found in Section 4.4.
4 Experiments
Ablation experiments are conducted with UNet and WNet, as depicted in Fig. 3. The training loss is the sum of the distance regression loss (ReLU+MSE) and the cosine embedding loss with local constraints (Eq. 1). The latest CodaLab dataset of CVPPP2017 LSC is used as training set without augmentation. Model parameters are initialized by He Normal [8] and optimized by Adam [9]
. The initial learning rate is set to 0.0001 and scheduled with exponential decay, with the decay period being set to 5000 steps and the decay rate 0.9. The batch size is set to 4 in most experiments, or 2 if high embedding dimensions are used. The maximal training epochs are set to 500. We show mSBD scores of testing set from CodaLab as the evaluation metric.
4.1 UNet vs. WNet
Firstly, we illustrate the performance improvement from UNet with two heads to the proposed WNet. In Fig. 4, two representative cases are demonstrated, where the UNet fails to separate closely located leaves. In contrast, the WNet has successfully distinguished the numbered leaves in Fig. 4.
Quantitatively, WNet surpasses UNet on overall mSBD by approximately 8% from 0.794 to 0.879 with the best setups for WNet, as shown in Table 2. Under different settings of embedding dimensions (Fig. 5(a)) and loss weights (Fig. 5(b)), the performance gap between UNet and WNet can be continuously observed and remain about 8%.
Concatenative  Net  mSBD 

Layer  
none (baseline)  UNet  .794 
coordinate  UNet  .798 
distmap  WNet  .824 
dfeat.8  WNet  .864 
dfeat.32  WNet  .879 
efeat.32  WNet  .847 
dfeat.16+efeat.16  WNet  .873 
Local  Net  Clustering  mSBD 
✓  WNet  AC  .879 
✓  WNet 64d  AC  .854 
WNet  AC  .835  
WNet 64d  AC  .823  
✓  UNet  MWS  .719 
✓  WNet  MWS  .771 
✓  UNet  MeanShift  .679 
✓  WNet  MeanShift  .733 
✓  UNet  HDBSCAN  .631 
✓  WNet  HDBSCAN  .681 
4.2 Concatenative Layer
We compare the effects of different types of concatenative layer. Firstly, the distmap (1dimensional) can be directly forwarded. Alternatively, the distance regression features instead of the distmap can be utilized. Before concatenation, we convert the 32channel Dfeat. into 8 and 32 dimensions (denoted as dfeat.8 and dfeat.32 in Table 2) through a single convolutional layer.
Meanwhile, the case of using embedding loss as the intermediate supervision (efeat.32) has also been tested. Specifically, the embedding features from the first UNet are concatenated with the images as the inputs of the second embedding module. Furthermore, the concatenated distance regression features and embedding features (dfeat.16+efeat.16) are also investigated. At last, augmenting the input image with coordinates is tested. As proposed in [14], constructing dense objectaware pixel embeddings cannot be easily achieved using convolutions and the situation can be improved by incorporating information about the pixel location. In this work, we augment the input image with two coordinate channels for the normalized x and ycoordinates, respectively.
Experimental results are summarized in Table 2. First of all, forwarding distmaps is not as effective as forwarding feature maps, including the distance regression features and the embedding features. The embedding features (efeat.32) can also boost the performance, but not as significantly as the distance regression features. This is verified by the fact that efeat.32 is worse than dfeat.32 and the mixed feature map dfeat.16+efeat.16. For the distance regression feature itself, higher dimensions of 32 are preferred. Finally, augmenting images with coordinates does not show apparent differences in our experiments. The effects could be further studied. For example, augmenting each intermediate feature map with coordinates is also worth being investigated.
4.3 Local vs. Global Constraints
Local constraints make it possible to exploit lowerdimensional embedding space more efficiently, as in this case, different labels only have to be distributed to the neighboring objects. In contrast, the global constraints have to thoroughly give each single object in the images a different label, which requires larger receptive fields and more redundant embedding space. The combination of local constraints and cosine embeddings utilizes the embedding space further comprehensively, as the push force imposed by loss expects orthogonal embedding clusters for neighboring instances.
This is confirmed qualitatively by examples showcased in Fig. 5. In Fig. 4(c), 8dimensional embeddings are trained with global constraints. Not surprisingly, there are exactly 8 colors in the image, indicating 8 orthogonal clusters in the embedding space. Apparently, the global constraint will fail when the embedding dimensions are fewer than the number of objects. In contrast, the local constraints (Fig. 4(a)  4(b)) can distribute labels alternately between objects, with the same labels appearing multiple times for nonadjacent objects. This makes it possible to utilize a lowerdimensional embedding space. Quantitatively, the WNet trained with local constraints surpasses the one trained with global constraints by more than 4% on overall mSBD, as listed in Table 2.
Intuitively, a higherdimensional embedding space is able to provide a higher degree of freedom, i.e. we could simply use higherdimensional embeddings to alleviate the problem of global constraints. At least the embedding vector does not have to be restricted to low dimensions. However, from the results in Fig.
5(a), we find that higherdimensional embeddings produce worse results. This makes the capability of using lowerdimensional embedding space particularly important.4.4 Dimensions of Embeddings
As discussed previously, the local constraints make the use of lowerdimensional embedding possible. It is thus worth investigating the influence of different embedding dimensions on the overall performance. The mSBD scores of both UNet and WNet for {4, 8, 16, 32, 64}dimensional embeddings are plotted in Fig. 5(a). For 32 and 64 dimensions, the batch size is set to 2, instead of 4 as in other cases, to fit the memory of a single GPU.
Our experiments show that the 8dimensional embedding brings in the best result. First of all, merely 4 dimensions are incompetent to separate all adjacent objects, since it is common that one object has more than 4 neighbors. Although higher dimensions may not bring in more labels under local constraints, comparing Fig. 4(a) to 4(b), increasing the embedding dimensions should not degrade the performance hypothetically. However, the mSBD score decreases slightly as the dimensions increase. Therefore we believe, under the premise that the dimensions are sufficient for all objects to fulfill the local constraints, higherdimensional embedding space is more difficult to train.
4.5 Loss Weights
During the experiments, we find that the values of betweeninstance loss term are approximately 10 times greater than the values of withininstance loss term . This is consistent with the fact that pixel embeddings of the same object converge tightly, but adjacent objects are not correctly segmented occasionally. The larger weighting factor of betweeninstance loss term might be helpful to emphasize the significance of it by amplification of its gradient. We set as {0.5, 1, 10, 100, 500}, and moreover, we omit the withininstance loss, denoted as only in Fig. 5(b). The experiments are preformed for both UNet and WNet under identical main setups: 32dimensional distance features as concatenative layer, local constraints and 8dimensional embeddings.
From the experiments, we find that larger weighting factor of the betweeninstance loss term does not further help to encourage the network to separate the confused objects when is larger than 1, but reduces the consistency of embeddings in the same object. Fig.7 showcases the tradeoff between the discrimination of adjacent objects (larger ) and the consistency of individual object (smaller ). The experiments show that brings in best overall performance, as shown in Fig. 5(b). Besides, one surprising conclusion is that training the network with just the betweeninstance loss term can also, to some extent, form clusters in the embedding space (Fig. 6(d)).
4.6 Clustering
Apart from the default angular clustering used along through the experiments, other three clustering techniques have been tested based on the predicted embeddings of the best results: Mutex Watershed [24], Mean Shift [6] and HDBSCAN [2]. On the one hand, this provides a reference for the performance of different clustering methods on the embeddings trained with cosine similarity based loss. On the other hand, it can also indirectly reflect the quality of embeddings generated by UNet and WNet. Results are shown in Table 2.
In conclusion, the angular clustering has advantages in terms of performance and speed. Nevertheless, it should be noted that this method is only applicable to the case, where seeds are available and clusters are orthogonal in the embedding space. Additionally, all clustering approaches produce better results with embeddings predicted from the WNet, which again confirms the improvement of our proposed method.
Method  Backbone  Train  Aug.  Emb.  Fg.  Lb.  mSBD  

A1  A13  A15  
IPK [15, 19]    A13  ✓  .791  .782    
Nottingham [19]    A13  ✓  .710  .686    
MSU [19, 26]    A13  ✓  .785  .780    
Wageningen [19]    A13  ✓  .773  .769    
MRCNN [4, 7]  ResNet  A13    .797    
Stardist [4, 20]  UNet  A13    .802    
ISRA [17]  FCN  A1  .849      
Ward [22]  ResNet  A14+syn  ✓  .900  .740  .810  
UPGen [23]  ResNet  A14+syn  ✓  .890  .877  .874  
DiscLoss [5]  ResNet  A1  ✓  euc  ✓  .842      
CERH [16]  HG  A1  ✓  cos  .845      
ELC [4]  UNet  A13  cos    .831  .823  
WNet (ours)  UNet  A14  cos  ✓  ✓  .919  .870  .879 
4.7 Comparison against StateoftheArt
Comparison of stateoftheart methods on the CVPPP LSC dataset is quantitatively shown in Table 3. It is clear that the learning based methods (denoted with backbones) can achieve better results than the first four classical methods. The last four methods are based on pixel embedding learning. Roughly speaking, they bring in promising results. Our overall result mSBD for A15 outperforms all others. In the leaderboard, our overall result is at the 1. position by paper submission. Furthermore, the average of mSBD scores for Arabidopsis images (A1, A2, A4) outperforms the second best results from three different users, respectively, by over 3%, namely 0.883 to 0.917. Due to the extremely imbalanced training images on Arabidopsis (783 images) and Tobacco (27 images), our result on testing set A3 are not as good as others, with mSBD of 0.77. Compared to this, the current 1. place mSBD of A3 in the leaderboard reaches 0.89. It implies that the sufficient number of training images is critical in our proposed method. We leave this room for improvement in the future. One thing worth mentioning is that the authors tend to not submit their results to the leaderboard of CodaLab, which makes the consistent comparison and review rather difficult.
4.8 Application to Human U2OS Cells
Our method has also been tested on the image set BBBC006v1 of human U2OS cells from the Broad Bioimage Benchmark Collection [11]. Totally 754 images are randomly separated into two equally distributed training and testing set with 377 images respectively. Other setups are identical to previously introduced ones. We use UNet and WNet with distance concatenative layer to show results in mSBD and mean Average Precision with IoU={0.5, 0.55, 0.6, …, 0.9} (mAP). The mSBD has increased from 0.896 to 0.915 and the mAP from 0.577 to 0.664. We showcase two examples of final labels in Fig. 8. As reported in [4], some embeddings around boundaries might be incomplete, which leads to incomplete segmentations. This problem has been mainly solved, as showcased in Fig. 8.
5 Conclusion
In this work we propose a novel WNet, which forwards the distance regression features learned by the firststage UNet to the subsequent embedding learning module. The intermediate distance regression supervision effectively promotes the accuracy of learned pixel embedding space, with the mSBD score on the CVPPP LSC dataset increased by more than 8% compared to the identical setup without supervision of distance regression features. We have also conducted a number of experiments to investigate the characteristics of the pixel embedding learning with the cosine similarity based loss, involving the embedding dimensions, the weighting factor of the withininstance loss term and the betweeninstance loss term. We are looking forward to applying this method to more datasets in the future.
5.0.1 Acknowledgments
This work was supported by the German Research Foundation (DFG) Research Training Group 2416 MultiSensesMultiScales.
