We present a new instance segmentation approach tailored to biological images, where instances may correspond to individual cells, organisms or plant parts. Unlike instance segmentation for user photographs or road scenes, in biological data object instances may be particularly densely packed, the appearance variation may be particularly low, the processing power may be restricted, while, on the other hand, the variability of sizes of individual instances may be limited. These peculiarities are successfully addressed and exploited by the proposed approach. Our approach describes each object instance using an expectation of a limited number of sine waves with frequencies and phases adjusted to particular object sizes and densities. At train time, a fully-convolutional network is learned to predict the object embeddings at each pixel using a simple pixelwise regression loss, while at test time the instances are recovered using clustering in the embeddings space. In the experiments, we show that our approach outperforms previous embedding-based instance segmentation approaches on a number of biological datasets, achieving state-of-the-art on a popular CVPPP benchmark. Notably, this excellent performance is combined with computational efficiency that is needed for deployment to domain specialists. The source code is publicly available at Github: https://github.com/kulikovv/harmonicREAD FULL TEXT VIEW PDF
Instance segmentation (object separation) in biological images often represents a key step in their analysis. Many biological image modalities (e.g. microscopy images of cell cultures) are characterized by excessive numbers of instances. Other modalities (e.g. worm assays) are characterized by tight and complex overlaps and occlusions. On the other hand, in most situations, the scale variations of objects of interest in biomedical data is less drastic than in natural photographs due to the lack of strong perspective effects. In this work, we propose a new instance segmentation approach designed for biological image instance segmentation that can address the challenges (number of instances, overlaps) while exploiting the simplifying properties (limited scale variation).
Our approach continues the line of works [7, 21] that perform instance segmentation by learning deep embeddings, and using clustering in the embedding space to recover the instances at test time. Learning good embeddings for object instances with a fully-convolutional network is however challenging, especially for biological data, where individual instances may have almost indistinguishable appearance.
To utilize the specific nature of biomedical images, we depart from the end-to-end philosophy of the previous embedding-based instance segmentation works, and split the learning process into two stages. At first stage, we seek a small set of harmonic functions that can be used to separate objects in the training dataset. The search is implemented as an optimization process that tunes the frequencies and phases of the harmonics to a specific range of scales and object densities in the data. The selected set of harmonics then guides the second stage of the learning process as well as the inference process at test-time.
At the second learning stage, we assign each ground truth object instance its harmonic embedding based on the expectation of the learned set of functions. We then learn a deep fully-convolutional network to predict resulting embeddings at each pixel. We show that learning with a simple pixelwise regression loss is feasible, as long as the information about harmonic functions is provided to convolutional layers in the network (which we achieve using a special new kind of a convolutional layer). The learned networks generalize well to new images, and tend to predict pixel embeddings that can be easily clustered into object instances.
In the experiments, we compare our approach to direct embedding-based instance segmentation  as well as to other state-of-the-art methods. Four biomedical datasets corresponding to plant phenotyping, bacterial and human cell culture microscopy, as well as C.Elegans assays are considered. We observe considerable improvement of performance brought by our approach.
combine object detection with object mask estimation, and currently represent state-of-the-art on instance segmentation non-biological benchmarks. The necessity to perform object detection followed by non-maximum suppression makes learning and tuning of methods from this group complex, especially in the presence of tight object overlaps when non-maximum suppression becomes fragile.
Another group of approaches to instance segmentation is based on recurrent neural networks (RNNs) and generate instances sequentially. Thus, Romera et al.  trains a network for end-to-end instance segmentation and counting using LSTM networks . Ren et al.  proposed a combination of a recurrent network with bounding box proposals. RNN-based frameworks show excellent performance on small datasets; they achieve state-of-the-art results on the CVPPP plants phenotyping dataset. The major problem of recurrent methods is the vanishing gradients effect that becomes particularly acute when the number of instances is large.
Our method falls in the category of proposal-free approaches to instance segmentation based on instance embedding. In this case, neural networks are used to embeds pixels of an image into a hidden multidimensional space, where embeddings for pixels belonging to the same instance should be close, while embeddings for pixels of different objects should be separated. A clustering algorithm may then be applied to separate instances. To achieve this, the approach 
penalizes pairs of pixels using a logistic distance function in the embedding space. The embedding is learned using log-loss function and requires to weight pairs of pixels in order to mitigate the size imbalance issue. This method also predicts a seedness score for each pixel, that correlates with the centeredness. They use this score to pick objects from the embedding. Kong at al. use differentiable Gaussian Blurring Mean-Shift for the recurrent grouping of embeddings. Deep Coloring  proposes a reduction of instance segmentation to semantic segmentation, whereas class labels are reused for non-adjacent objects. The instances are then retrieved using connected component analysis.
Most related to ours, De Brabandere et al.  use a non-pairwise discriminative loss function composed of two parts: one pushing different objects embeddings centers further apart, while the other pulling embeddings of the same object pixels closer to its mean. Instances are retrieved using the mean-shift algorithm. The approach  uses metric learning together with an explicit assignment of the center of mass as the target embedding. Our approach follows the general paradigm of [7, 21], however suggests a special kind of embeddings detailed below. The use of new embeddings result in an explicit assignment of embeddings to each pixel in the training image, thus simplifying the learning process.
We now discuss our approach in details. Existing instance embedding methods [7, 16, 9] do not prespecify target embeddings for pixels in the training set. Instead, they rely on the learning process itself to define these embeddings. In contrast, our goal is to define “good” embeddings to pixels a priori. “Goodness” here means amenability for clustering as well as learnability by a convolutional architecture.
Let be a family of real-valued functions in the image domain, where corresponds to the coordinates of the argument, and
is a set of learnable parameters defining the shape of the function (e.g. the frequency vector and the phase of a sine function). As our approach is built in many ways around this family of functions, we call the function familythe guide functions.
Let be an arbitrary set of pixels (e.g. an object instance in the ground truth annotation of a training image). We denote with the expectation of over :
If denotes the joint vector of parameters of all functions, then the guided embedding of an object determined by is defined as the following -dimensional vector:
To sum up, the guided embedding maps each object to the expectations of the guide functions over this object.
Given a new dataset representing a new type of instance segmentation problem, our goal is to find a good set of guide functions (1), so that different objects have well-separated guided embeddings.
To do that, we first restrict to a certain functional family parameterized by the parameters . As discussed above, in many biomedical datasets, there is a certain (imperfect) regularity in the location of objects. E.g. monolayer cell cultures organize themselves in a texture composed of elements of approximately same size and adjacent to each other. Such loosely-regular, semi-periodic structure calls for the use of harmonic functions as guides:
where and are image width and height respectively, and are frequency parameters, and is a phase parameter.
Assume now that a set of training images is given. We can then estimate the quality of guided embeddings by looking at pairs and of objects belonging to the same image (e.g. two different cells from the same image) and finding out how frequently they have very close embeddings. Ideally, we want to avoid such collisions in the embedding space (at the very least, we want to avoid them on the training set). The following loss is therefore considered:
where is the distance, is the margin meta-parameter, and denotes the set of all pairs of objects from the training image . Each individual term in (4) is a hinge loss term that is non-zero, if the guided embeddings of a certain object pair are too close (closer than ).
To find good guide functions, we minimize the loss (4
) on the training set. We perform stochastic gradient descent over the training set by drawing minibatches of random pairs of objects from random images and updatingto minimize (4) for the pairs from the minibatch. In our implementation, we initialize frequency parameters and
to uniformly distributed random numbers from the interval, while the phase parameters are initialized uniformly from .
The outcome of the learning is a set of guide functions, such that pairs of objects from training images have their guided embeddings separated in the embedding space. For typical settings and , most pairs in the training set end up isolated by more than the margin.
Assume now that the parameters of the guide functions have been optimized on the training set, so that the parameters are now fixed. To derive the loss for the second stage of the training process, we further denote be a mapping from pixel to an object containing this pixel.
We then train a deep fully-convolutional embedding network with parameters to map input images to sets of -channel images, where each pixel is assigned an -dimensional embedding. During learning, we minimize the following simple loss function:
Here, denotes the set of foreground pixels of image and denotes the output of the network at the spatial position (if the foreground/background segmentation is not available, then the summation is taken over the full image). By minimizing (5), we encourage the network to map each pixel to the guided embedding of the object it belongs to.
We have found that standard fully-convolutional architectures (e.g. U-Net ) perform very well and achieve low train and test set losses (5) provided one important modification to convolutional layers is made. When modifying a convolutional layer , we augment its input with an extra set of maps holding the guide functions values. Specifically, the extra maps contain the values at each spatial position . Here, is the downsampling factor of the layer (compared to the input/output resolution). The use of downsampling factor is needed to make sure that the augmenting maps in different layers are spatially aligned with the output.
Note that our augmentation idea generalizes the recently suggested CoordConv layer  that augmented the input of convolutional layers with . By analogy, and since the guiding functions in our implementation are harmonic, we call the new operation SinConv layer (Figure 2).
At test time, the application of the learned embedding network is straightforward. The network is applied to an input image. Our post-processing is then similar to the one suggested in . We use the mean-shift clustering algorithm  to obtain instance masks from the embeddings space (Figure 3 bottom row). The mean-shift bandwidth is set to the margin used in the guide function selection, since both parameters have the meaning of the desirable separation between the embeddings of different instances.
We provide results of our method on three challenging biomedical datasets of bright-field microscopy images of C.elegans, E.Coli, Hela and the plant phenotyping dataset (CVPPP 2017 sequence A1). In each case, learning was done on a single NVidia Tesla V100 GPU. The training in all cases was performed using ADAM optimizer with learning rate 1e-5. All code was implemented using PyTorch framework.
The architecture and data augmentation were same for all datasets. In our experiments we have used the U-Net  neural network and replaced the first convolution of each upscaling block with the SinConv layer. The network was trained from scratch. Due to a small number of training images in those datasets, we have added some data augmentation procedures, namely cropping patches of size , scaling, and left-right flips. The number of embedding dimensions was set to 12 (with that dimensionality and =0.5 we obtained zero error in hinge loss 4, during guide function selection), and the mean-shift bandwidth was set to
. Note that availability of parameters that work well for diverse datasets is very important for practitioners. The number of training epochs was set differently for different datasets due to their varying complexity.
We used Symmetric best Dice coefficient (SBD) and average precision (AP) as metrics. The SBD metric averages the intersection over union (IOU) between pairs of predicted and the ground truth labels yielding maximum IOU. The AP metrics integrates precision for different recall values.
We have used De Brabandere et al.  as the main baseline, and have reimplemented their approach using the same network architecture as ours (both variants with and without SinConv layers were tried). On the CVPPP dataset where the authors’ implementation results is known, the result of our re-implementation is considerably better suggesting that our re-implementation forms a strong baseline.
The Computer Vision Problems in Plants Phenotyping (CVPPP) dataset(Figure 4) is one of the most popular instance segmentation benchmark. The dataset consists of five sequences of different plants. We have used the most common sequence A1 that has the most significant number of baselines. The A1 sequence has 128 top-down view images, with pixels size each as a training set, and an additional hidden test set with 33 images from the same sequence. The task of instance segmentation is challenging because of the high variety of leaf shapes and complex occlusion between leaves. The performance of competing algorithms is SBD metrics and the absolute difference in counting (c.f. ).
To fit that embedding space we have trained the neural network for 500 epochs. Table 1 shows our method currently being the state-of-the-art compared to all published methods.
|Recurrent IS ||1.1||56.8|
|Recurrent IS+CRF ||1.1||66.6|
|Recurrent with attention ||0.8||84.9|
|Discriminative loss ||1.0||84.2|
|Deep coloring ||2.0||80.4|
|Discriminative loss  (our implementation)|
|Ours without SinConv||5.||78.3|
The E.coli dataset (Figure 5) is interesting because of the number of organisms is large and they are crowded. The dataset contain 37 brightfield images. The ground truth is derived using watershed algorithm  from weak annotations, in which every organism is annotated by a line segment.
At test time, images were processed by non-overlapping crops of size . The SBD score was calculated for each crop independently and then averaged. The performance of our method is better compared to other methods prevously evaluated on this dataset (Table 2). Unfortunately, we were not able to get reasonable results from the method 
, probably due to a drastic change of the organism number between different crops.
The HeLa cancer cells dataset111Courtesy Dr. Gert van Cappellen Erasmus Medical Center of Rotterdam. (Figure 6) is quite different from the other three datasets. Cells take a large part of each image, and, being cancerous, are more irregular and form intricate patterns. In contrast to small and crowded pictures with E.Coli, the number of cells is moderate, but they have a large area and more diverse sizes. The dataset contains 18 partially annotated single channel training images. Following best practices we split the dataset into train and test parts (9 images each). The goal of this experiment is thus to show that our method can generalize well given very few training examples.
We trained the network for 8000 epochs. No information about the background was used in the dataset. On this dataset we achieve SBD without foreground mask, and SBD with foreground mask, which is insignificantly outperforms the semantic segmentation baseline of IOU reported in . Our implementation of the baseline method  didn’t show any reasonable results with current configuration.
Finally, we look at the C.elegans dataset (Figure 7), which is available from the Broad Bioimage Benchmark Collection . This sequence contains 97 two channel images 696 x 520 pixels, each of roundworm C.elegans. Each image contains approximately 30 organisms, some of them in complex overlapping patterns. In order to compare with results from  we follow their protocol: the whole dataset was split into to equal parts - 50 training set and 47 test images. Here, we use the binary segmentation masks (following [21, 28, 30]). The network was trained for 1000 epochs.
|Semi-convolutional operators ||0.569||0.885||0.661||0.511||0.671|
|Mask RCNN ||0.559||0.865||0.641||0.502||0.650|
|Discriminative loss  (our implementation without SinConv)||0.343||0.624||0.380||0.441||0.563|
|Discriminative loss  (our implementation with SinConv)||0.478||0.771||0.560||0.551||0.677|
Despite the improving state-of-the-art results on biological datasets, the proposed method has several limitations that need to be resolved before applying to more complex datasets with severe variety in object scales like COCO , PASCAL VOC , Cityscapes  (where our initial attempts to apply the method lead to mediocre, i.e. mid-table results). To the best of our understanding, the sub-par performance of the method is caused by inability to handle very diverse scales gracefully. We are currently investigating multi-scale schemes as well as other families of guide functions, which may potentially improve the results.
We have presented a new instance segmentation approach that exploits the peculiarities of biological images. The approach is based around new type of embedding based on sine waves with parameters adjusted to achieve separation of ground truth instances in the train dataset, prior to the main training stage. We have shown that such precomputation of good embeddings greatly simplifies the learning stage, whenever the same guide patterns are inputted in some of the convolutional layers of the embedding network. The ease of training is evidenced by the superior performance of our method compared to .
In the experiments we have shown the ability of our method to handle rather diverse biological image data, while using the same relatively small architecture and the same set of meta-parameters. Such versatility is valuable for practical deployment of the method to domain specialists.
The source code is publicly available at Github: https://github.com/kulikovv/harmonic
This work was supported by the Skoltech NGP Program (MIT-Skoltech 1911/R). Assistance of Skoltech HPC team is deeply appreciated. High-performance computations presented in the paper were carried out on Skoltech HPC cluster Zhores.
PyTorch tensors and dynamic neural networks in python with strong gpu acceleration.http://pytorch.org. Accessed: 2018-11-15.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016.