1 Introduction
Live cell microscopy imaging is a key component in the biological research process. However, without the proper analysis tools, the raw images are a diamond in the rough. One must obtain the segmentation of the raw images defining the individual cells prior to calculation of the cells’ properties. Manual segmentation is infeasible due to the large quantity of images and cells per image.
Automatic segmentation tools are available and roughly split into two groups, supervised and unsupervised methods. The methods vary and include: automatic gray level thresholding [1], the watershed algorithm [2] and Active Contours [3, 4]. Another approach is to support the segmentation algorithm with temporal information from tracking algorithms as was proposed by [5, 6]. All these methods assume some structure in the data that may not fit every case.
Supervised methods, on the other hand, do not assume any structure rather aim to learn it from the data. Classic machine learning methods generally require two independent steps, feature extraction and classification. In most cases the feature extraction is based either on prior knowledge of the image properties such as in
[7] or general image properties such as smoothing filters, edge filters, etc. A widely used toolbox which takes a pixel classification approach is Ilastik [8], using a random forest classifier trained on predefined features extracted from a user’s scribbles on the image.
Recent developments in the computer vision community have shown the strength Convolutional Neural Networks (CNNs) which surpass state of the art methods in object classification
[9], semantic segmentation [10] and many other tasks. Recent attempts at cell segmentation using CNNs include [11, 12]. The common ground of all CNN methods is the need for an extensive training set alongside a predefined loss function such as the crossentropy (CE).In this work we present a novel approach for microscopy cell segmentation inspired by the GAN [13] and extension thereof [14, 15, 16, 17]. The GAN framework is based on two networks, a generator and a discriminator, trained simultaneously, with opposing objectives. This allows the discriminator to act as an abstract loss function in contrast to the common CE and losses. We propose a pair of adversarial networks, an estimator and a discriminator for the task of microscopy cell segmentation. Unlike the original GAN [13]
, we do not generate images from random noise vectors, rather estimate the underlaying variables of an image. The estimator learns to output some segmentation of the image while the discriminator learns to distinguish between expert manual segmentations and estimated segmentations given the associated image. The discriminator is trained to
minimize a classification loss on two classes, manual and estimated, i.e. minimizing the similarity between the two. The estimator, on the other hand, is trained to maximize the discriminator’s loss and effectively, maximize the similarity. In [18], semantic segmentation of natural images are generated for a set of predefined class. However, the main difference lays in our need to separate instances of a single class (cells) and not to separate different classes. The method also differs in choice of discriminator architecture and training method.Our contribution is threefold. We expand the concept of the GAN for the task of cell segmentation and in that reduce the dependency on a selection of loss function. We propose a novel architecture for the discriminator, referred to as the “Rib Cage” architecture (See section 2.4.2), which is adapted to the problem. The “Rib Cage” architecture includes several cross connections between the image and the segmentation, allowing the network to model complicated correlation between the two. Furthermore we show that accurate segmentations can be achieved with a low number of training examples therefore dramatically reducing the manual workload.
The rest of the paper is organized as follows. Section 2 defines the problem and elaborates on the proposed solution. Section 3 presents the results for both a common adversarial and nonadversarial loss compared to the proposed method, showing promising initial results. Section 4 summarizes and concludes the work thus far.
2 Methods
2.1 Problem Formulation
Let define the image domain and let the image
be an example generated by the random variable
. Our objective is to partition the image into individual cells, where the main difficulty is separating adjacent cells. Let the segmentation image be a partitioning of to three disjoint sets, background, foreground (cell nuclei) and cell contour, also generated by some random variable . The two random variables are statistically dependent with some unknown joint probability . The problem we address can be formulated as the most likely partitioning from the data given only a small number, , of example pairs . Had been known, the optimal estimator would be the Maximum Likelihood (ML) estimator:(1) 
However, since is unknown and cannot be calculated, we learn the nearoptimal estimator of using the manual segmentation, , as our target.
2.2 Estimation Network
We propose an estimator in the form of a CNN with parameters . We wish to train the estimator such that the estimated will be as close as possible to the optimal ML estimation . This is achieved by optimizing for some loss function (defined in section 2.3):
(2) 
2.3 Adversarial Networks
Unlike the GAN, aiming to generate examples from an unknown distribution, we aim to estimate the variables of an unknown conditional distribution . Defining the loss either in a supervised pixelbased way, e.g. norm, or in an unsupervised global method, by a cost functional that constrains partition into homogenous regions while minimizing the length of their boundaries, is usually not well defined. We define the loss by pairing our estimator with a discriminator. Let and denote the estimator and discriminator respectively, both implemented as CNN with parameters and respectively. The estimator aims to find the best estimation of the partitioning given the image . The discriminator on the other hand tries to distinguish between and given pairs of either or and outputs the probability that the input is manual rather than estimated denoted as . As is in the GAN case, the objectives of the estimator and the discriminator are exactly opposing and so are the losses for training and . We train to maximize the probability of assigning the correct label to both manual examples and examples estimated by . We simultaneously train to minimize the same probability, essentially trying to make and as similar as possible:
(3) 
(4) 
In other words, and are players in a minmax game with the value function:
(5) 
The equilibrium is achieved when and are similar such that the discriminator can not distinguish between the pairs and .
2.4 Implementation Details
2.4.1 Estimator Network Architecture
The estimator
net is designed as a five layer fully CNN, each layer is constructed of a convolution followed by batch normalization and leakyReLU activation. The output of the estimator is an image with the same size as the input image with three channels corresponding to the probability that a pixel belongs to the background, foreground or cell contour.
2.4.2 Discriminator Network Architecture
The discriminator is designed with a more complex structure. The discriminators task is to distinguish manual and estimated segmentation images given a specific gray level (GL) image. The question arrises of how to design the discriminator architecture which can get both the GL and segmentation images as input. A basic design s that of a classification CNN where both images are concatenated in the channel axis, as done in [17]. However, we believe that this approach is not optimal for our needs since, for this task, the discriminator should be able match high level features from the GL and segmentation images. Yet these features may have very different appearances. For example, an edge of a cell in the GL image appear as a transition from white to black while the same edge in the segmentation image appears as a thin blue line. This difference requires the network to learn individual filters for each semantic region. Then, finding correlations between the two is a more feasible task. For these reasons we designed a specific architecture, referred to as a “Rib Cage” architecture, which has three channels. The first and second channels get inputs from the GL channel and segmentation channel respectively, each channel calculates feature maps using a convolutional layer, we refer to these channels as the “Ribs”. The third channel, referred to as the “Spine”, gets a concatenation of inputs from both the GL and segmentation channels and matches feature maps (i.e correlations). See Figure 1 for an illustration of the “Rib Cage” block. The discriminator is designed as three consecutive “Rib Cage” blocks followed by two fullyconnected (FC) layers with leakyReLU activations and a final FC layer with one output and a sigmoid activation for classification. Figure 2 illustrates the discriminator design. The architecture parameters for the convolution layers are describes as and FC layers as . The parameters for the estimator: . The discriminator spine used half the number of filters as the ribs: .
2.4.3 Data
We trained the networks on the H1299 data set [20] consisting of frames of size pixels. Each frame captures approximately 50 cells. Manual annotation of randomly selected frames was done by an expert. The annotated set was split into a training set and validation set. The training set was subsampled to examples for training which were augmented using randomly cropped areas of size pixels along with random flip and random rotation. The images were annotated using three labels for the background (red), cell nucleus (green) and nucleus contour (blue) encoded as RGB images.
3 Experiments and Results








Segmentation example of a validation image given a different number of training examples. The odd and even rows show the full image and a zoomed area respectively. Notice that in all cases the cells in the second row were correctly separated even though they are very close together. The bottom right shows the result when training with the CE loss.
ADV1  ADV2  ADV4  ADV11  CE  11  Class Disc  Ilastik  

Prec  89.9%  85.4%  86.8%  85.8%  83.6%  78.7%  81.2% 
Rec  82%  87.2%  86.8%  86.5%  86.4%  81.14%  80.2% 
F  85.8%  86.3%  86.8%  86.1%  84.8%  79.9%  80.7% 
J  80.6%  75.8%  77.4%  74.6%  72.1%  60.2%  68.4% 
We conducted four experiments, training the networks with different values for . All other parameters were set identically. We evaluated the segmentation using the as described in the caption of Table 1. We compared the adversarial training regime to the common CE loss, training only the estimator. We furthermore evaluate our choice of RibCage discriminator versus a classification architecture (VGG16 [21]). We also compared our results to state of the art segmentation tool, Ilastik [8]. The manual annotation were done by an expert. The quantitative results of the individual cell segmentation are detailed in Table 1. Note that the amount of images in the training data had little effect on the results. Figure 3 shows an example of a segmented frame. It is clear that the networks learned a few distinct properties of the segmentation. First, each cell is encircled by a thin blue line. Second, the shape of the contour follows the true shape of the cell. Some drawbacks are still seen where two cells completely touch and the boundary is difficult to determine.
4 Summary
In this work we propose a new concept for microscopy cell segmentation using CNN with adversarial loss. The contribution of such an approach is twofold. First, the loss function is automatically defined as it is learned along side the estimator, making this a simple to use algorithm with no tuning necessary. Second, we show that this method is robust to low number of training examples surpassing.
The quantitative results, as well as the visual results, show clearly that both the estimator and our unique “Rib Cage” discriminator learn both global and local properties of the segmentation, i.e the shape of the cell and the contour surrounding the cell, and the fitting of segmentation edges to cell edges. These properties could not be learned using only a pixelbases CE loss as is commonly done.
References
 [1] T. Kanade, et al., “Cell image analysis: Algorithms, system and applications,” in WACV. IEEE, 2011, pp. 374–381.
 [2] L. Vincent and P. Soille, “Watersheds in digital spaces: an efficient algorithm based on immersion simulations,” IEEE PAMI, vol. 13, no. 6, pp. 583–598, 1991.
 [3] P. Bamford and B. Lovell, “Unsupervised cell nucleus segmentation with active contours,” Signal Processing, vol. 71, no. 2, pp. 203–213, 1998.
 [4] E. Meijering, et al., “Methods for cell and particle tracking,” Methods Enzymol, vol. 504, no. 9, pp. 183–200, 2012.
 [5] M. Schiegg, et al., “Graphical model for joint segmentation and tracking of multiple dividing cells,” Bioinformatics, vol. 31, no. 6, pp. 948–956, 2014.
 [6] A. Arbelle, et al., “Analysis of high throughput microscopy videos: Catching up with cell dynamics,” in MICCAI 2015, pp. 218–225. Springer, 2015.
 [7] H. Su, et al., “Cell segmentation in phase contrast microscopy images via semisupervised classification over opticsrelated features,” MEDIA, vol. 17, no. 7, pp. 746–765, 2013.
 [8] C. Sommer, et al., “Ilastik: Interactive learning and segmentation toolkit,” in IEEE ISBI 2011, March 2011, pp. 230–233.

[9]
A. Krizhevsky, I. Sutskever, and G. E. Hinton,
“Imagenet classification with deep convolutional neural networks,”
in NIPS, 2012, pp. 1097–1105.  [10] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE CVPR, 2015, pp. 3431–3440.
 [11] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” arXiv preprint arXiv:1505.04597, 2015.
 [12] O. Z. Kraus, J. L. Ba, and B. J. Frey, “Classifying and segmenting microscopy images with deep multiple instance learning,” Bioinformatics, vol. 32, no. 12, pp. i52–i59, 2016.
 [13] I. Goodfellow, et al., “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
 [14] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [15] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
 [16] D. J. Im, et al., “Generating images with recurrent adversarial networks,” arXiv preprint arXiv:1602.05110, 2016.
 [17] P. Isola, et al., “Imagetoimage translation with conditional adversarial networks,” arXiv preprint arXiv:1611.07004, 2016.
 [18] P. Luc, et al., “Semantic segmentation using adversarial networks,” arXiv preprint arXiv:1611.08408, 2016.

[19]
K. Sadanandan, et al.,
“Spheroid segmentation using multiscale deep adversarial networks,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 36–41.  [20] A. A. Cohen, et al., “Dynamic proteomics of individual cancer cells in response to a drug,” science, vol. 322, no. 5907, pp. 1511–1516, 2008.
 [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
Comments
There are no comments yet.