Computer-assisted diagnostic tools are becoming an important asset for a more effective cancer diagnosis. Such tools typically require capturing images, stain color normalization, segmentation of cells of interest, and classification to count malignant versus healthy cells. Despite the recent surge of self-supervised and unsupervised methods, the use of supervised approaches is still predominant for medical images. The main drawback of supervised methods is the need for quality labels, which are especially hard to obtain for medical image datasets because of the need for domain experts. Nevertheless, the compilation of labeled datasets will still be needed to foster advancements in specific fields, at least for evaluation purposes. The development of methods that can deal with noise and ambiguity in labels is thus crucial in order to leverage datasets generated with limited resources.
The development of automatic methods for cell segmentation in microscopy images has been studied for decades . However, it has been in recent years that the proliferation of deep learning has led to state-of-the-art results in, fairly, any field within medical image analysis . Specifically regarding segmentation, U-Net  represents a fully convolutional architecture that outperformed existing methods in cell segmentation and tracking challenges. This architecture served as a stem for many others and was recently presented in a self-configuring manner 
, which greatly simplifies hyperparameter tunning across different data modalities and surpassed highly specialized solutions on 23 public medical datasets. Similarly, Mask R-CNN represents a universal approach for instance segmentation that was adapted in different domains, including cell nucleus segmentation .
Multiple Myeloma (MM) is a type of blood cancer, specifically, a plasma cell cancer. The first stage to build an automated diagnostic tool for MM is the robust segmentation of cells. This was the overall goal of the SegPC-2021 challenge, whose organizers provided images captured from the bone marrow aspirate slides of MM patients. The problem of segmenting plasma cells in these images is complex due to the diversity of situations that may occur. For instance, cells may appear in clusters or isolated, with varying size of their nucleus and cytoplasm, some of these touching each other and being hard to differentiate. Furthermore, the staining process is not perfect and there may be unstained cells or cells for which the cytoplasm’s color resembles that of the background. The problem is challenging per se, but the use of computer-generated imperfect labels adds complexity to it because of the associated noise and inconsistencies.
In this paper, we present a method for instance segmentation of MM plasma cells in microscopy images that were labeled using a semi-automatic procedure resulting in noisy annotations. Many works have proposed ad-hoc solutions to deal with such labels . In our work, we rely on image augmentation and model ensembles to cope with SegPC-2021 imperfect labels, which resulted in arguably better segmentation results than the actual ground-truth (GT) labels and, eventually, in the winning entry for SegPC-2021 challenge.
2 SegPC-2021 CHALLENGE DATASET
The SegPC-2021 dataset  consisted of images captured from slides of bone marrow aspirates collected from patients suffering from MM. The competition organizers used stain color normalization  to preprocess the images. Images were annotated in two stages. First, the cells of interest were identified and marked by an expert oncologist. Then, the nuclei of those cells were segmented using the method described in . The expert annotations were used to automatically select the relevant segmented nuclei and discard the rest, and their cytoplasm was segmented using MATLAB’s Local Graph Cut tool . This tool automatically produces segmentation masks for a region of interest that can be later refined by the user. Even though the generated instance masks corresponded only to the cells of interest identified by an oncologist, the quality of those masks was not evaluated. Some examples of original images labeled by the expert and the ground truth masks provided for the competition are included in Figure 1.
The use of the aforementioned semi-automated procedure to label the images in the competition dataset led to noisy and inconsistent labels. Examples of this can be seen in Figure 2. Overall, ground truth masks are characterized by spiky contours that do not reflect the true aspect of the cell. In some cases, cytoplasm regions whose color is close to that of the background are partially and sometimes completely omitted. Although less common, nucleus masks also include noise in some images (e.g. last column in Fig. 2). These problems require a robust method that can learn to segment cells consistently, despite the high variability seen in their morphological and color characteristics.
The competition training data for the final phase consisted of 498 images, which were additionally split into train/validation sets (20% validation set). The final competition test set, on which all of the competitors were evaluated, consisted of 277 images. The only data used in our experiments was provided as part of the SegPC-2021 competition. No additional external data was used.
Our approach is based on instance segmentation methods. The use of semantic segmentation models would require additional post-processing to split the resulting image-level masks into different instances. We decided to avoid this and tackle the problem directly as an instance segmentation problem for simplicity.
3.1 Instance Segmentation
Three existing instance segmentation models have been used: Mask R-CNN , Hybrid Task Cascade (HTC) , and SCNet . An in-depth analysis of these architectures is out of the scope of this article, so we kindly refer the readers to the corresponding papers. Nonetheless, as a high-level explanation, note that Mask R-CNN serves as a baseline approach both for HTC and SCNet. HTC combines Mask R-CNN and Cascade R-CNN  in a novel way by interweaving the detection and segmentation tasks and including an additional semantic branch to provide spatial context. Based on HTC, the recent SCNet combines all the mask prediction stages in a single stage and moves it to the end to improve sample consistency between training and inference. Additionally, it includes a custom feature relay stage and a global context branch. All the aforementioned methods require a convolutional backbone to act as a feature extractor. In our work, we have explored the use of ResNet  and the more recent ResNeSt .
In the SegPC-2021 challenge, pixels belong to one of two classes, apart from the background: nucleus or cytoplasm. Unlike other classical instance segmentation datasets that include many different types of scenes, in this case, all images depict the same type of content and the two classes are related spatially. In general, a cell nucleus appears surrounded by cytoplasm, composing a whole cell.
When carrying out instance segmentation, results for different classes are given separately. This poses the problem of accurately pairing corresponding nucleus and cytoplasm instances. To ease this process, we decided to add a third class that comprises the whole cell, both nucleus and cytoplasm. This class is taken as the main segmentation result and is later combined with the other two to produce the final two-class result. The semantic branch of both HTC and SCNet models, whose purpose is to provide global context, was fed during training with binary masks for this whole-cell class including all the instances in an image.
3.2 Image Augmentation
Due to the limited amount of training data, heavy image augmentation has been carried out. We generated 50 additional augmented images for each image in the training set, adding up to 20298 training images. Several different kinds of augmentations were combined: geometric (scale, flip, rotate, elastic transform, piece-wise affine transform, perspective transform), color (brightness, color to gray, modification of hue and saturation), contrast (CLAHE, gamma contrast, linear contrast), blur (Gaussian, median, motion), convolutional (sharpen, emboss), and gaussian noise corruption. Augmentations were generated using the imgaug library111https://github.com/aleju/imgaug. Test-time augmentation was also used but limited to random flip (vertical, horizontal, and diagonal).
3.3 Model Ensemble
In order to provide more robust segmentation masks, a custom strategy to combine the results of different models was designed taking into account the particularities of the dataset and evaluation method of the SegPC-2021 challenge, which are described in Sections 2 and 4.1, respectively.
Out of a set of several trained models, one of them is selected as a reference model. To refine its predictions, majority voting is carried out for each one of the cells it segmented using one prediction per each of the remaining models. For each of these models, the instance with the largest IoU with that of the reference model is selected. A minimum IoU is needed for a prediction to participate in the voting. All the predictions that were not used in the majority voting procedure were included as part of the submission to the competition as well.
4 Experiments and Results
The metric used for the evaluation of the models in the SegPC-2021 is the mean Intersection-over-Union (mIoU). Specifically, for each one of the ground truth cell instances in an image, the candidate instance with the highest IoU from all the predictions for that image is selected. The mIoU is the result of accumulating the IoU of all the selected predictions and dividing it by the total number of ground truth instances. The predicted cells that are not selected as the best match for any of the ground truth cells do not penalize in any way the final score.
To illustrate the performance of our method when dealing with noisy GT labels, we include in Figure 3 the four worst predictions done by the final model ensemble for our validation set. Note that the predictions correspond to the best-found match (in terms of IoU) with the GT label. In all four cases, our predictions can be argued to better adjust to the cells than the provided GT masks. In the first example, for which our best match only got an IoU of 0.6376 with the GT, we see that the issue is a mislabeling of two nuclei touching each other as a single cell. In the second example, with an IoU of 0.6752, the GT label misses part of the cytoplasm, while for the third example, in which it is hard to tell if the labeled cytoplasm corresponds entirely to that nucleus, our prediction achieves a 0.7606 IoU. Finally, in the last example the GT seems to leave out part of the cytoplasm on the bottom part, while our prediction (IoU = 0.7734) captures it adequately.
shows the predictions of our final ensemble for two sample images. Predictions with the highest IoU with respect to the ground truth instance masks were selected, similarly to what is done by the competition evaluation script. Therefore, not all our predictions for an image are included here. In the first row, we observe that all cells of interest were detected very accurately. It is worth noting that the spikier contours of the ground truth labels were not replicated into the predicted masks. Overall, predicted cells present softer contours which, in our opinion, reflect better the actual shape of the cells as perceived from the original images. The second example shows that our model succeeded to classify noisy nucleus labels to better match those in the original image.
for the models we trained for the SegPC-2021 competition. The best performing epoch in terms of mean Average Precision (mAP) on our validation set was selected for each of the models. This was a key aspect since considerable variations could be observed along the training process, most likely indicating over-fitting to the limited training data.
|Dataset||Our validation set||Competition final test set|
For any of the backbones we used, the Hybrid Task Cascade (HTC)  architecture consistently outperformed Mask R-CNN , and SCNet  does so for both the other architectures. The best performing single model was SCNet with ResNet-50 backbone. When comparing different backbones depths, we observe that deeper ones perform better for the simpler Mask R-CNN but, overall, HTC and SCNet generate better predictions when using shallower 50-layer backbones.
In this article, we have presented the winning solution for the SegPC-2021 competition in detail. The semi-automatic labeling method used produced valuable but imperfect labels. Relying on heavy data augmentation and a custom procedure to aggregate results from different models based on majority voting, robust instance segmentation results were generated. The combination of state-of-the-art convolutional backbones and instance segmentation architectures led to a variety of single models that already showed remarkable performance. The predictions of seven of these models were aggregated and the resulting model ensemble outperformed the solutions of the rest of the participants with a mIoU of 0.9389.
Acknowledgements.This work was partially supported by the European Commission through the Horizon 2020 research and innovation program under grant 826121 (iPC).
-  (2018) Cascade r-cnn: delving into high quality object detection. In , pp. 6154–6162. Cited by: §3.1.
-  (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4983. Cited by: §3.1, §4.2.
EDNFC-net: convolutional neural network with nested feature concatenation for nuclei-instance segmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1389–1393. Cited by: §2.
-  (2020) GCTI-sn: geometry-inspired chemical and tissue invariant stain normalization of microscopic medical images. Medical Image Analysis 65, pp. 101788. Cited by: §2.
-  Cited by: §2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2961–2969. Cited by: §1, §3.1, §4.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.1.
-  (2021) NnU-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18 (2), pp. 203–211. Cited by: §1.
-  (2019) Automatic nucleus segmentation with mask-rcnn. In Science and Information Conference, pp. 399–407. Cited by: §1.
-  (2020) Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Medical Image Analysis 65, pp. 101759. Cited by: §1.
-  (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §1.
-  (2012) Cell segmentation: 50 years down the road [life sciences]. IEEE Signal Processing Magazine 29 (5), pp. 140–145. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
-  (2004) ”GrabCut” interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG) 23 (3), pp. 309–314. Cited by: §2.
SCNet: training inference sample consistency for instance segmentation.
AAAI Conference on Artificial Intelligence, Cited by: §3.1, §4.2.
-  (2020) Resnest: split-attention networks. arXiv preprint arXiv:2004.08955. Cited by: §3.1.