QANet - Quality Assurance Network for Microscopy Cell Segmentation

by   Assaf Arbelle, et al.

Tools and methods for automatic image segmentation are rapidly developing, each with its own strengths and weaknesses. While these methods are designed to be as general as possible, there are no guarantees for their performance on new data. The choice between methods is usually based on benchmark performance whereas the data in the benchmark can be significantly different than that of the user. We introduce a novel Deep Learning method which, given an image and a proposed corresponding segmentation, estimates the Intersection over Union measure (IoU) with respect to the unknown ground truth. We refer to this method as a Quality Assurance Network - QANet. The QANet is designed to give the user an estimate of the segmentation quality on the users own, private, data without the need for human inspection or labelling. It is based on the RibCage Network architecture, originally proposed in an adversarial network framework. Promising IoU prediction results are demonstrated based on the Cell Segmentation Benchmark. code is freely available at: TBD



There are no comments yet.



FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics

Electron microscopic connectomics is an ambitious research direction wit...

Robust Image Segmentation Quality Assessment without Ground Truth

Deep learning based image segmentation methods have achieved great succe...

An Auxiliary Task for Learning Nuclei Segmentation in 3D Microscopy Images

Segmentation of cell nuclei in microscopy images is a prevalent necessit...

Impact of Ground Truth Annotation Quality on Performance of Semantic Image Segmentation of Traffic Conditions

Preparation of high-quality datasets for the urban scene understanding i...

From augmented microscopy to the topological transformer: a new approach in cell image analysis for Alzheimer's research

Cell image analysis is crucial in Alzheimer's research to detect the pre...

Deep Learning Based Segmentation of Various Brain Lesions for Radiosurgery

Semantic segmentation of medical images with deep learning models is rap...

End-to-End Segmentation via Patch-wise Polygons Prediction

The leading segmentation methods represent the output map as a pixel gri...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image segmentation is a well-studied problem, playing a major role in almost any image analysis task. Model-based as well as data-driven approaches are usually validated on ‘previously unseen’, annotated test sets, being compared to benchmarks and current state-of-the-art segmentation methods adapted to the task examined, the imaging modality and depicted scene or objects. Nevertheless, despite the countless number of publicly (and commercially) available toolboxes and source codes, some of which obtaining almost ‘human-level’ scores on some well known challenges, image segmentation is not considered a solved problem.

Consider for example live cell microscopy image segmentation. Instance segmentations of individual cells allows the extraction of useful information at the cellular level - such as cells’ internal and external structure, protein level, dynamics, signalling, cycle and more. Further analysis of this output may shed light on biological processes and phenomena - potentially making a significant impact on healthcare research. The implications of some of these biological findings are critical, thus every step in the research process, in particular cell segmentation, must be reliable. This raises the question whether a user can be assured that a given segmentation method, even one that is shown to provide accurate results on a comprehensive set of test examples, will perform well enough on different (though possibly similar) data. Moreover, since current methods are ranked based on the statistics of the measured score, how could a user detect specific cases of segmentation failures which may risk the overall biological analysis? In other words, can we avoid visual inspection of the results or an additional test with data-specific manual annotations, in this, and alike, important pipelines?

This paper is the first to address (to the best of our knowledge) quantitative assessment of image segmentation without

ground-truth annotations or any other information but the image and its corresponding proposed segmentation. Specifically, we propose a deep neural network, termed Quality Assurance Network - QANet, that is able to estimate the intersection-over-union (IoU) scores of image-segmentation pairs.

Some Computer Vision methods, not necessarily for image segmentation, produce confidence scores alongside their output. The YOLO 

[6, 7, 8] for example, designed for object detection, predicts both a bounding box and its estimated IoU. Nevertheless, there are essential differences between the QANet and the YOLO confidence score. Not only is a bounding-box a very crude estimation of the instance segmentation, but the QANet works on the output of other methods while the YOLO’s scoring can be applied only to its own outputs. When considering networks which score the outputs of other networks, the discriminator in an adversarial framework comes to mind [3]

. However, we should note that the goal of the discriminator is not to regress the confidence score, but only to preform as an implicit loss function for its adversarial network. It is usually trained for binary classification of "real" versus "fake" examples, and once the training is done, the discriminator collapses and does not produce informative outputs.

The QANet is based on a unique architecture, called the RibCage Network that was first introduced in [2]. The structure of two ‘ribs’, connected to a spine allows a multi-level, spatially sensitive comparison of its inputs: a gray-level image and its proposed segmentation, represented as a trinary (foreground-background-cell contour) image. The network is trained to estimate the IoU score between the proposed segmentation and the unknown ground-truth segmentation of the input image. It should be stressed that QANet does not aim to estimate the ground truth segmentation and compare it with the proposed one. Instead, it solves a regression problem, providing a scalar between zero and one which can be seen as a quality assessment score of the proposed segmentation.

While the QANet’s architecture and its training regime can be applied to any type of data, we chose to validate it on live cell microscopy images. Specifically, we used fluorescence microscopy data sets from the Cell Segmentatoin Benchmark111 [10]. Considering the segmentation of individual cells, the QANet’s output is an estimate of the average IoU scores calculated for each connected component comprising the proposed segmentation, as in the benchmark. We trained the network on simulated data, Flou-N2DH-SIM+, and tested it based on the segmentation results of KTH-SE(1) [4] and CVUT-CZ that were among the top three methods in the challenge on Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa test datasets. Very promising results are shown.

The rest of the paper is organized as follows: In Section 2 we formulate the QA problem and quality measure; present the RibCage architecture as well as the loss; and discuss the simulation of training examples. Experimental results are presented in Section 3. We conclude in Section 4.

2 Method

2.1 Formulation

Let be the gray-level image of size and let be the corresponding ground truth (GT) segmentation with labels corresponding to the background, foreground and cell contour respectively. Each connected component of the foreground represents a single cell. Let be a proposed segmentation of . The quality of the proposed segmentation can be calculated using the SEG measure [5] which will be explained in the following section. We denote the true SEG measure for a pair of GT and proposed segmentation as . Our goal is to estimate the SEG measure given only the raw image and the proposed segmentation denoted as .

2.2 The SEG Measure

Let and be the number of individual cells in the GT and propsed segmentation, and respectively. Let and define corresponding objects in the proposed and GT segmentations, respectively. The SEG measure introduced in [5] is defined as the IOU of the GT and the proposed objects, unless their overlap is lower than 50%. The mean SEG measure over all ground truth objects is formulated as follows:


where, and We note that for every object in there exists at most one object with overlap grater than 50%. If there is no such object, the cell is considered undetected and its SEG score is set to zero.

2.3 RibCage Network Architecture and Loss

Figure 1: Network Architecture: The QANet is designed in the form of a RibCage Network, the top rib processes the raw gray-level image while the bottom rib processes the segmentation proposal. The spine processes the concatenation of the two ribs throughout the depth of the network. The four RibCage blocks are followed by three FC layers which output a single scalar representing the estimated measure.

In order to implement the estimator we chose a RibCage Network. The RibCage Network was originally proposed by [1]

as a discriminator in an adversarial setting. The strength of the RibCage Network is it’s ability to extract and compare low and high level features from two images in a spatially sensitive manner on multiple image scales. The RibCage Network is comprised of a component called a rib-cage which gets three inputs: left rib, right rib and spine. Each of the ribs is passed through a convolutional layer, batch normalization and a ReLU activation. The spine is concatenated (on the channel axis) with the two ribs and passed through a convolutional layer, batch normalization and a ReLU activation. The three outputs are passed to the next block (see Figure 

1 for an outline of the network). Refer to our publicly available code for technical implementation details.

We denote the network with parameters as . The QANet is trained to regress the values of the SEG measure. The training loss, , is the mean squared error (MSE) between the networks output and the true SEG value :


3 Experiments

3.1 Evaluation Measure

MSE between the GT SEG and the predicted SEG is the most straight forward measure for evaluating the QANet. However, for demonstrating the QANet performances for different SEG values we used a scatter plot showing the GT SEG, , with respect to the predicted SEG, , for the test examples. In addition, we calculated the Hit Rate - which is the normalized number of test examples of which the differences between the GT SEG and the predicted SEG are within a specified tolerance. We then used the Area Under the Curve (AUC) of the Hit Rate versus tolerance plot as another measure of evaluation.

3.2 Simulated and ‘Real’ Data

The QANet’s input includes pairs of cell images and their proposed segmentation predictions. The GT target score is the true SEG scores calculated based on the GT segmentation. We used three datasets of the Cell Segmentation benchmark: Fluo-N2DH-SIM+, Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa. Simulated segmentation predictions were generated by elastic transformations of the GT segmentations of the Fluo-N2DH-SIM+ training sets. ‘Real’ segmentation predictions include outputs of two state-of-the-art Deep Learning based cell segmentation methods, BGU-IL(3) [2], CVUT-CZ and a state-of-the-art classical cell segmentation method KTH-SE(1) [4] which we downloaded from the Cell Segmentation Benchmark website.

In our experiments we tested two possible representations of the segmentation: the binary, foreground-background segmentation, as is done in [9] and the three class, foreground-background-edge segmentation, as is described in [1, 2]. The QANet was adapted to account for both representations.

3.3 Training and Validation

(,) (,) (,) (,) (,)
Figure 2: Examples of images from the validation set. The green and red contours mark the GT and simulated segmentation predictions, respectively. Below each image are the true SEG measure, , and the value estimated by the QANet, , left and right respectively.

Data Synthesis: The training data for the QANet is entirely synthetic. The raw images and GT segmentation, and respectively, were taken from the training set of the Fluo-N2DH-SIM+ dataset, which comprised of two simulated fluorescent sequences with full GT annotations. In order to generate examples with a wide range of IoU, segmentation predictions were obtained by applying random elastic deformations to the GT segmentations. The elastic transformation is the same as was implemented for data augmentation in [9].
Validation: The dataset was split into a training, validation and test set. Figure 2 displays some example images from the test set and the estimated SEG measures. The left Figure 3 shows the scatter plot of the estimated versus the true SEG measure of the validation and test sets. The right Figure 3 displays the Hit Rate as a function of IoU tolerance. AUC measures for test and validation are 0.871 and 0.895, respectively. MSE measures for test and validation are 0.796 and 0.848, respectively. Hit Rate of 0.18 was obtained for IoU tolerance of 18% (Hit Rate plot’s knee).

Figure 3: The left image shows the scatter plot of the validation and test images. The horizontal axis is the GT IoU of the instance and the vertical axis show the QANet output. The diagonal line represents the optimal, desired, output. On the right is a Hit Rate curve as a function of IoU prediction tolerance.

3.4 Network Architecture and Input Format

We compare three alternative architectures for the QANet, the RibCage Network, a Siamese Network and a classification network. In our experiments, all the layers have the same number of features and end with three FC layers. The networks differ in the first four layers:
The RibCage network is as describes above.The Siamese network is comprised of two independent streams of four convolutional layers, one getting as input and the other . The outputs of the last convolutional layers are concatenated and fed into the FC layers.
The classification network gets a single input, the concatenation on the channel axis of the grayscale image with proposed segmentation image, and . It is comprised of four convolutional layers followed by the FC layers.
We tested the tree alternatives on the outputs of the CVUT-CZ and the KTH-SE(1) methods on two datasets: Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa.

RibCage Network (3-class) 0.904 0.947 0.912 0.944
RibCage Network (2-class) 0.902 0.933 0.908 0.917
Siamese Network 0.819 0.745 0.804 0.727
Classification Network 0.890 0.934 0.849 0.914
Table 1: The AUC scores for evaluating the segmentation predictions of KTH-SE(1) and CVUT-CZ methods on N2DH-GOWT1 and N2DL-HeLa datasets. The table compares three network architecture alternatives: RibCage Network (with binary or trinary segmentation input), Siamese Network and Classification Network. The RibCage Network with trinary segmentation inputs is consistently better than the two alternatives

Figure 4 displays the Hit Rate curves and Table 1 AUC scores obtained for the three architectures where the RibCage was tested for both the 2-class and 3-class input formats. The results demonstrate the advantage of the RibCage network with 3-class segmentation.

Figure 4: The Hit Rate curves as a function of IoU prediction tolerance. On the left are the curves for the RibCage Network with either binary or trinary segmentation input. The right plot shows the comparison to the three network architectures, namely RibCage, Siamese and Classification networks. All configurations were tested using two segmentation methods: KTH-SE(1) and CVUT-CZ, on two datasets: Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa

3.5 Evaluation of State-of-the-art Segmentation Methods

An alternative approach to the QANet could be the cross evaluation between multiple segmentation methods. For example, given two segmentation methods, one could act as a surrogate GT segmentation for the other, and vice versa. While this approach is valid, we show that it is less accurate than the QANet. The prediction capabilities of the QANet were tested on the outputs of BGU-IL(3), CVUT-CZ and KTH-SE(1). Each method was applied to the Fluo-N2DH-SIM+ test set. We note that the ground truth annotations for the test set are unavailable, however the final scores, as validated by the challenge organizers, are published on the challenge website. We then measured the mean output of the QANet and the cross method evaluation score. Table 2 shows the true and predicted SEG scores. As is demonstrated in the table, QANet’s predicted SEG scores for all tested segmentation methods are more accurate than the scores obtained by the surrogate GT.

Evaluated Method
SEG Score QANet Score Cross-method Score & Surrogate GT
BGU-IL(3) 0.811 0.808 (-0.003) 0.767 (-0.044) KTH-SE(1)
CVUT-CZ 0.807 0.813 (+0.006) 0.769 (-0.038) KTH-SE(1)
KTH-SE(1) 0.791 0.799 (+0.007) 0.772 (-0.019) CVUT-CZ
Table 2: Predicted SEG score results for the Cell Segmentation Benchmark SIM+ dataset for three leading segmentation methods: BGU-IL(3), CVUT-CZ and KTH-SE(1) of the challenge. Prediction was done for the test data, where the GT segmentations of this data are unknown to us. The true SEG scores are according to the Benchmark web-page. The table presents estimated QANet and the cross-method evaluation with respect to a surrogate GT. In brackets are the deviation from the true SEG score

4 Summary

In this paper, we introduced the QANet - a method for scoring the accuracy of any instance segmentation method, at the single image level, without the need for ground truth annotations or manual inspection of the target data. The results, based on the publicly available Cell Benchmark datasets, presented in Section 3 show the QANet’s ability to generalize to different datasets and segmentation methods, while being trained only on simulated data.

The QANet does not in itself produce the segmentation of the image, but rather estimates the quality of a given segmentation. Though it is a supervised network, its output can be exploited to alleviate training on user-specific data in an unsupervised manner when ground truth annotations are not available. In the same fashion, it can be used to evaluate segmentation predictions for an on-line learning framework.

A possible alternative to the QANet would be to use two segmentation methods and use one to test the other. Such comparison, presented in 3.5

demonstrates the superiority of the QANet. Our assumption is that regardless of the method - a classical or a machine learning one - segmentation processes are guided by similar principles, therefore, it is not unlikely that different segmentation methods will fail on similar examples and thus fail to evaluate each other. The QANet on the other hand has the advantage of being exposed to both the image and its segmentation - and its only task is grading, paraphrasing the British statesman Benjamin Disraeli “…it is easer to be critical than to be correct”.