Image segmentation is a well-studied problem, playing a major role in almost any image analysis task. Model-based as well as data-driven approaches are usually validated on ‘previously unseen’, annotated test sets, being compared to benchmarks and current state-of-the-art segmentation methods adapted to the task examined, the imaging modality and depicted scene or objects. Nevertheless, despite the countless number of publicly (and commercially) available toolboxes and source codes, some of which obtaining almost ‘human-level’ scores on some well known challenges, image segmentation is not considered a solved problem.
Consider for example live cell microscopy image segmentation. Instance segmentations of individual cells allows the extraction of useful information at the cellular level - such as cells’ internal and external structure, protein level, dynamics, signalling, cycle and more. Further analysis of this output may shed light on biological processes and phenomena - potentially making a significant impact on healthcare research. The implications of some of these biological findings are critical, thus every step in the research process, in particular cell segmentation, must be reliable. This raises the question whether a user can be assured that a given segmentation method, even one that is shown to provide accurate results on a comprehensive set of test examples, will perform well enough on different (though possibly similar) data. Moreover, since current methods are ranked based on the statistics of the measured score, how could a user detect specific cases of segmentation failures which may risk the overall biological analysis? In other words, can we avoid visual inspection of the results or an additional test with data-specific manual annotations, in this, and alike, important pipelines?
This paper is the first to address (to the best of our knowledge) quantitative assessment of image segmentation without
ground-truth annotations or any other information but the image and its corresponding proposed segmentation. Specifically, we propose a deep neural network, termed Quality Assurance Network - QANet, that is able to estimate the intersection-over-union (IoU) scores of image-segmentation pairs.
Some Computer Vision methods, not necessarily for image segmentation, produce confidence scores alongside their output. The YOLO[6, 7, 8] for example, designed for object detection, predicts both a bounding box and its estimated IoU. Nevertheless, there are essential differences between the QANet and the YOLO confidence score. Not only is a bounding-box a very crude estimation of the instance segmentation, but the QANet works on the output of other methods while the YOLO’s scoring can be applied only to its own outputs. When considering networks which score the outputs of other networks, the discriminator in an adversarial framework comes to mind 
. However, we should note that the goal of the discriminator is not to regress the confidence score, but only to preform as an implicit loss function for its adversarial network. It is usually trained for binary classification of "real" versus "fake" examples, and once the training is done, the discriminator collapses and does not produce informative outputs.
The QANet is based on a unique architecture, called the RibCage Network that was first introduced in . The structure of two ‘ribs’, connected to a spine allows a multi-level, spatially sensitive comparison of its inputs: a gray-level image and its proposed segmentation, represented as a trinary (foreground-background-cell contour) image. The network is trained to estimate the IoU score between the proposed segmentation and the unknown ground-truth segmentation of the input image. It should be stressed that QANet does not aim to estimate the ground truth segmentation and compare it with the proposed one. Instead, it solves a regression problem, providing a scalar between zero and one which can be seen as a quality assessment score of the proposed segmentation.
While the QANet’s architecture and its training regime can be applied to any type of data, we chose to validate it on live cell microscopy images. Specifically, we used fluorescence microscopy data sets from the Cell Segmentatoin Benchmark111http://celltrackingchallenge.net/latest-csb-results/ . Considering the segmentation of individual cells, the QANet’s output is an estimate of the average IoU scores calculated for each connected component comprising the proposed segmentation, as in the benchmark. We trained the network on simulated data, Flou-N2DH-SIM+, and tested it based on the segmentation results of KTH-SE(1)  and CVUT-CZ that were among the top three methods in the challenge on Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa test datasets. Very promising results are shown.
Let be the gray-level image of size and let be the corresponding ground truth (GT) segmentation with labels corresponding to the background, foreground and cell contour respectively. Each connected component of the foreground represents a single cell. Let be a proposed segmentation of . The quality of the proposed segmentation can be calculated using the SEG measure  which will be explained in the following section. We denote the true SEG measure for a pair of GT and proposed segmentation as . Our goal is to estimate the SEG measure given only the raw image and the proposed segmentation denoted as .
2.2 The SEG Measure
Let and be the number of individual cells in the GT and propsed segmentation, and respectively. Let and define corresponding objects in the proposed and GT segmentations, respectively. The SEG measure introduced in  is defined as the IOU of the GT and the proposed objects, unless their overlap is lower than 50%. The mean SEG measure over all ground truth objects is formulated as follows:
where, and We note that for every object in there exists at most one object with overlap grater than 50%. If there is no such object, the cell is considered undetected and its SEG score is set to zero.
2.3 RibCage Network Architecture and Loss
In order to implement the estimator we chose a RibCage Network. The RibCage Network was originally proposed by 
as a discriminator in an adversarial setting. The strength of the RibCage Network is it’s ability to extract and compare low and high level features from two images in a spatially sensitive manner on multiple image scales. The RibCage Network is comprised of a component called a rib-cage which gets three inputs: left rib, right rib and spine. Each of the ribs is passed through a convolutional layer, batch normalization and a ReLU activation. The spine is concatenated (on the channel axis) with the two ribs and passed through a convolutional layer, batch normalization and a ReLU activation. The three outputs are passed to the next block (see Figure1 for an outline of the network). Refer to our publicly available code for technical implementation details.
We denote the network with parameters as . The QANet is trained to regress the values of the SEG measure. The training loss, , is the mean squared error (MSE) between the networks output and the true SEG value :
3.1 Evaluation Measure
MSE between the GT SEG and the predicted SEG is the most straight forward measure for evaluating the QANet. However, for demonstrating the QANet performances for different SEG values we used a scatter plot showing the GT SEG, , with respect to the predicted SEG, , for the test examples. In addition, we calculated the Hit Rate - which is the normalized number of test examples of which the differences between the GT SEG and the predicted SEG are within a specified tolerance. We then used the Area Under the Curve (AUC) of the Hit Rate versus tolerance plot as another measure of evaluation.
3.2 Simulated and ‘Real’ Data
The QANet’s input includes pairs of cell images and their proposed segmentation predictions. The GT target score is the true SEG scores calculated based on the GT segmentation. We used three datasets of the Cell Segmentation benchmark: Fluo-N2DH-SIM+, Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa. Simulated segmentation predictions were generated by elastic transformations of the GT segmentations of the Fluo-N2DH-SIM+ training sets. ‘Real’ segmentation predictions include outputs of two state-of-the-art Deep Learning based cell segmentation methods, BGU-IL(3) , CVUT-CZ and a state-of-the-art classical cell segmentation method KTH-SE(1)  which we downloaded from the Cell Segmentation Benchmark website.
3.3 Training and Validation
The training data for the QANet is entirely synthetic. The raw images
and GT segmentation, and respectively, were
taken from the training set of the Fluo-N2DH-SIM+ dataset, which
comprised of two simulated fluorescent sequences with full GT
In order to generate examples with a wide range of IoU, segmentation predictions were obtained by applying random elastic deformations to the GT segmentations. The elastic transformation is the same as was implemented for data augmentation in .
Validation: The dataset was split into a training, validation and test set. Figure 2 displays some example images from the test set and the estimated SEG measures. The left Figure 3 shows the scatter plot of the estimated versus the true SEG measure of the validation and test sets. The right Figure 3 displays the Hit Rate as a function of IoU tolerance. AUC measures for test and validation are 0.871 and 0.895, respectively. MSE measures for test and validation are 0.796 and 0.848, respectively. Hit Rate of 0.18 was obtained for IoU tolerance of 18% (Hit Rate plot’s knee).
3.4 Network Architecture and Input Format
We compare three alternative architectures for the QANet, the RibCage
Network, a Siamese Network and a classification network. In our
experiments, all the layers have the same number of features and end
with three FC layers.
The networks differ in the first four layers:
The RibCage network is as describes above.The Siamese network is comprised of two independent streams of four convolutional layers, one getting as input and the other . The outputs of the last convolutional layers are concatenated and fed into the FC layers.
The classification network gets a single input, the concatenation on the channel axis of the grayscale image with proposed segmentation image, and . It is comprised of four convolutional layers followed by the FC layers.
We tested the tree alternatives on the outputs of the CVUT-CZ and the KTH-SE(1) methods on two datasets: Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa.
|RibCage Network (3-class)||0.904||0.947||0.912||0.944|
|RibCage Network (2-class)||0.902||0.933||0.908||0.917|
Figure 4 displays the Hit Rate curves and Table 1 AUC scores obtained for the three architectures where the RibCage was tested for both the 2-class and 3-class input formats. The results demonstrate the advantage of the RibCage network with 3-class segmentation.
3.5 Evaluation of State-of-the-art Segmentation Methods
An alternative approach to the QANet could be the cross evaluation between multiple segmentation methods. For example, given two segmentation methods, one could act as a surrogate GT segmentation for the other, and vice versa. While this approach is valid, we show that it is less accurate than the QANet. The prediction capabilities of the QANet were tested on the outputs of BGU-IL(3), CVUT-CZ and KTH-SE(1). Each method was applied to the Fluo-N2DH-SIM+ test set. We note that the ground truth annotations for the test set are unavailable, however the final scores, as validated by the challenge organizers, are published on the challenge website. We then measured the mean output of the QANet and the cross method evaluation score. Table 2 shows the true and predicted SEG scores. As is demonstrated in the table, QANet’s predicted SEG scores for all tested segmentation methods are more accurate than the scores obtained by the surrogate GT.
|SEG Score||QANet Score||Cross-method Score & Surrogate GT|
|BGU-IL(3)||0.811||0.808 (-0.003)||0.767 (-0.044)||KTH-SE(1)|
|CVUT-CZ||0.807||0.813 (+0.006)||0.769 (-0.038)||KTH-SE(1)|
|KTH-SE(1)||0.791||0.799 (+0.007)||0.772 (-0.019)||CVUT-CZ|
In this paper, we introduced the QANet - a method for scoring the accuracy of any instance segmentation method, at the single image level, without the need for ground truth annotations or manual inspection of the target data. The results, based on the publicly available Cell Benchmark datasets, presented in Section 3 show the QANet’s ability to generalize to different datasets and segmentation methods, while being trained only on simulated data.
The QANet does not in itself produce the segmentation of the image, but rather estimates the quality of a given segmentation. Though it is a supervised network, its output can be exploited to alleviate training on user-specific data in an unsupervised manner when ground truth annotations are not available. In the same fashion, it can be used to evaluate segmentation predictions for an on-line learning framework.
A possible alternative to the QANet would be to use two segmentation methods and use one to test the other. Such comparison, presented in 3.5
demonstrates the superiority of the QANet. Our assumption is that regardless of the method - a classical or a machine learning one - segmentation processes are guided by similar principles, therefore, it is not unlikely that different segmentation methods will fail on similar examples and thus fail to evaluate each other. The QANet on the other hand has the advantage of being exposed to both the image and its segmentation - and its only task is grading, paraphrasing the British statesman Benjamin Disraeli “…it is easer to be critical than to be correct”.
-  Arbelle, A., Riklin Raviv, T.: Microscopy cell segmentation via adversarial neural networks. IEEE ISBI pp. 645–648 (2018)
-  Arbelle, A., Riklin Raviv, T.: Microscopy cell segmentation via via convolutional LSTM networks. arXiv preprint arXiv:1805.11247 (2019)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. pp. 2672–2680 (2014)
-  Magnusson, K.E.G.: Segmentation and tracking of cells and particles in time-lapse microscopy. Ph.D. thesis, KTH Royal Institute of Technology (2016)
-  Maška, M., Ulman, V., Svoboda, D., et al.: A benchmark for comparison of cell tracking algorithms. Bioinformatics 30(11), 1609–1617 (2014)
-  Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. pp. 779–788 (2016)
-  Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR. pp. 7263–7271 (2017)
-  Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. arXiv preprint arXiv:1505.04597 (2015)
-  Ulman, V., Maška, M., Magnusson, K., et al.: An objective comparison of cell-tracking algorithms. Nature methods 14(12), 1141 (2017)