Segmentation of anatomical structures in medical images is known to suffer from high inter-reader variability Lazarus et al. (2006); Watadani et al. (2013); Rosenkrantz et al. (2013); Menze et al. (2014); Joskowicz et al. (2019)
, influencing the performance of downstream supervised machine learning models. This problem is particularly prominent in the medical domain where the labelled data is commonly scarce due to the high cost of annotations. For instance, accurate identification of multiple sclerosis (MS) lesions in MRIs is difficult even for experienced experts due to variability in lesion location, size, shape and anatomical variability across patientsZhang et al. (2019). Another example Menze et al. (2014) reports the average inter-reader variability in the range 74-85% for glioblastoma (a type of brain tumour) segmentation. Further aggravated by differences in biases and levels of expertise, segmentation annotations of structures in medical images suffer from high annotation variations Kats et al. (2019). In consequence, despite the present abundance of medical imaging data thanks to over two decades of digitisation, the world still remains relatively short of access to data with curated labels Harvey and Glocker (2019), that is amenable to machine learning, necessitating intelligent methods to learn robustly from such noisy annotations.
To mitigate inter-reader variations, different pre-processing techniques are commonly used to curate segmentation annotations by fusing labels from different experts. The most basic yet popular approach is based on the majority vote where the most representative opinion of the experts is treated as the ground truth (GT). A smarter version that accounts for similarity of classes has proven effective in aggregation of brain tumour segmentation labels Menze et al. (2014). A key limitation of such approaches, however, is that all experts are assumed to be equally reliable. Warfield et al.Warfield et al. (2004) proposed a label fusion method, called STAPLE that explicitly models the reliability of individual experts and uses that information to “weigh” their opinions in the label aggregation step. After consistent demonstration of its superiority over the standard majority-vote pre-processing in multiple applications, STAPLE has become the go-to label fusion method in the creation of public medical image segmentation datasets e.g., ISLES Winzeck et al. (2018), MSSeg Commowick et al. (2018), Gleason’19 gle datasets. Asman et al.later extended this approach in Asman and Landman (2011) by accounting for voxel-wise consensus to address the issue of under-estimation of annotators’ reliability. In Asman and Landman (2012), another extension was proposed in order to model the reliability of annotators across different pixels in images. More recently, within the context of multi-atlas segmentation problems Iglesias et al. (2013) where image registration is used to warp segments from labeled images (“atlases”) onto a new scan, STAPLE has been enhanced in multiple ways to encode the information of the underlying images into the label aggregation process. A notable example is STEP proposed in Cardoso et al.Cardoso et al. (2013) who designed a strategy to further incorporate the local morphological similarity between atlases and target images, and different extensions of this approach such as Asman and Landman (2013); Akhondi-Asl et al. (2014) have since been considered. However, these previous label fusion approaches have a common drawback—they critically lack a mechanism to integrate information across different training images. This fundamentally limits the remit of applications to cases where each image comes with a reasonable number of annotations from multiple experts, which can be prohibitively expensive in practice. Moreover, relatively simplistic functions are used to model the relationship between observed noisy annotations, true labels and reliability of experts, which may fail to capture complex characteristics of human annotators.
In this work, we introduce the first instance of an end-to-end supervised segmentation method that jointly estimates, from noisy labels alone, the reliability of multiple human annotators and true segmentation labels. The proposed architecture (Fig. 1
) consists of two coupled CNNs where one estimates the true segmentation probabilities and the other models the characteristics of individual annotators (e.g., tendency to over-segmentation, mix-up between different classes, etc) by estimating the pixel-wise confusion matrices (CMs) on a per image basis. Unlike STAPLEWarfield et al. (2004)
and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the true segmentation label. Furthermore, the parameters of the CNNs are “global variables” that are optimised across different image samples; this enables the model to disentangle robustly the annotators’ mistakes and the true labels based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image). In contrast, this would not be possible with STAPLEWarfield et al. (2004) and its variants Asman and Landman (2012); Cardoso et al. (2013) where the annotators’ parameters are estimated on every target image separately.
For evaluation, we first simulate a diverse range of annotator types on the MNIST dataset by performing morphometric operations with Morpho-MNIST framework Castro et al. (2019). Then we demonstrate the potential in several real-world medical imaging datasets, namely (i) MS lesion segmentation dataset (MSLSC) from the ISBI 2015 challenge Styner et al. (2008), (ii) Brain tumour segmentation dataset (BraTS) Menze et al. (2014) and (iii) Lung nodule segmentation dataset (LIDC-IDRI) Armato III et al. (2011). Experiments on all datasets demonstrate that our method consistently leads to better segmentation performance compared to widely adopted label-fusion methods and other relevant baselines, especially when the number of available labels for each image is low and the degree of annotator disagreement is high.
2 Related Work
The majority of algorithmic innovations in the space of label aggregation for segmentation have uniquely originated from the medical imaging community, partly due to the prominence of the inter-reader variability problem in the field, and the wide-reaching values of reliable segmentation methods Asman and Landman (2012). The aforementioned methods based on the STAPLE-framework such as Warfield et al. (2004); Asman and Landman (2011, 2012); Cardoso et al. (2013); Weisenfeld and Warfield (2011); Asman and Landman (2013, 2013); Akhondi-Asl et al. (2014); Joskowicz et al. (2018) are based on generative models of human behaviours, where the latent variables of interest are the unobserved true labels and the “reliability” of the respective annotators. Our method can be viewed as an instance of translation of the STAPLE-framework to the supervised learning paradigm. As such, our method produces a model that can segment test images without needing to acquire labels from annotators or atlases unlike STAPLE and its local variants. Another key difference is that our method is jointly trained on many different subjects while the STAPLE-variants are only fitted on a per-subject basis. This means that our method is able to learn from correlations between different subjects, which previous works have not attempted— for example, our method uniquely can estimate the reliability and true labels even when there is only one label available per input image as shown later.
Our work also relates to a recent strand of methods that aim to generate a set of diverse and plausible segmentation proposals on a given image. Notably, probabilistic U-net Kohl et al. (2018) and its recent variants, PHiSeg Baumgartner et al. (2019) have shown that the aforementioned inter-reader variations in segmentation labels can be modelled with sophisticated forms of probabilistic CNNs. Such approaches, however, fundamentally differ from ours in that variable annotations from many experts in the training data are assumed to be all realistic instances of the true segmentation; we assume, on the other hand, that there is a single, unknown, true segmentation map of the underlying anatomy, and each individual annotator produces a noisy approximation to it with variations that reflect their individual characteristics. The latter assumption may be reasonable in the context of segmentation problems since there exists only one true boundary of the physical objects captured in an image while multiple hypothesis can arise from ambiguities in human interpretations.
We also note that, in standard classification problems, a plethora of different works have shown the utility of modelling the labeling process of human annotators in restoring the true label distribution Raykar et al. (2010); Khetan et al. (2017); Tanno et al. (2019). Such approaches can be categorized into two groups: (1) two-stage approach Dawid and Skene (1979); Smyth et al. (1995); Whitehill et al. (2009); Welinder et al. (2010); Rodrigues et al. (2013), and (2) simultaneous approach Raykar et al. (2009); Yan et al. (2010); Branson et al. (2017); Van Horn et al. (2018); Khetan et al. (2017); Tanno et al. (2019). In the first category, the noisy labels are first curated through a probabilistic model of annotators, and subsequently, a supervised machine-learning model is trained on the curated labels. The initial attempt Dawid and Skene (1979) was made in the early 1970s, and numerous advances such as Smyth et al. (1995); Whitehill et al. (2009); Welinder et al. (2010); Rodrigues et al. (2013) since built upon this work e.g. by estimating sample difficulty and human biases. In contrast, models in the second category aim to curate labels and learn a supervised model jointly in an end-to-end fashion Raykar et al. (2009); Yan et al. (2010); Branson et al. (2017); Van Horn et al. (2018); Khetan et al. (2017); Tanno et al. (2019) so that the two components inform each other. Although the evidence still remains limited to the simple classification task, these simultaneous approaches have shown promising improvements over the methods in the first category in terms of the predictive performance of the supervised model and the sample efficiency (i.e., fewer labels are required per input). However, to date very little attention has been paid to the same problem in more complicated, structured prediction tasks where the outputs are high dimensional. In this work, we propose the first simultaneous approach to addressing such a problem for image segmentation, while drawing inspirations from the STAPLE framework Warfield et al. (2004) which would fall into the two-stage approach category.
3.1 Problem Set-up
In this work, we consider the problem of learning a supervised segmentation model from noisy labels acquired from multiple human annotators. Specifically, we consider a scenario where set of images (with denoting the width, height and channels of the image) are assigned with noisy segmentation labels from multiple annotators where denotes the label from annotator and denotes the set of all annotators who labelled image and denotes the set of classes.
Here we assume that every image annotated by at least one person i.e., , and no GT labels are available. The problem of interest here is to learn the unobserved true segmentation distribution from such noisy labelled dataset i.e., the combination of images, noisy annotations and experts’ identities for labels (which label was obtained from whom).
We also emphasise that the goal at inference time is to segment a given unlabelled test image but not to fuse multiple available labels as is typically done in multi-atlas segmentation approaches Iglesias et al. (2013).
3.2 Probabilistic Model and Proposed Architecture
Here we describe the probabilistic model of the observed noisy labels from multiple annotators. We make two key assumptions: (1) annotators are statistically independent, (2) annotations over different pixels are independent given the input image. Under these assumptions, the probability of observing noisy labels on x factorises as:
where denotes the elements of . Now we rewrite the probability of observing each noisy label on each pixel as:
where denotes the GT label distribution over the pixel in the image x, and describes the noisy labelling process by which annotator corrupts the true segmentation label. In particular, we refer to the matrix whose each element is defined by the second term as the CM of annotator at pixel in image .
We introduce a CNN-based architecture which models the different constituents in the above joint probability distributionas illustrated in Fig. 1. The model consists of two components: (1) Segmentation Network, parametrised by , which estimates the GT segmentation probability map, whose each element approximates ;(2) Annotator Network, parametrised by , that generate estimates of the pixel-wise CMs of respective annotators as a function of the input image, whose each element approximates . Each product represents the estimated segmentation probability map of the corresponding annotator. Note that here “” denotes the element-wise matrix multiplications in the spatial dimensions . At inference time, we use the output of the segmentation network to segment test images.
We note that each spatial CM, contains variables, and calculating the corresponding annotator’s prediction requires floating-point operations, potentially incurring a large time/space cost when the number of classes is large. Although not the focus of this work (as we are concerned with medical imaging applications for which the number of classes are mostly limited to less than 10), we also consider a low-rank approximation (rank) scheme to alleviate this issue wherever appropriate. More details are provided in the supplementary.
3.3 Learning Spatial Confusion Matrices and True Segmentation
Next, we describe how we jointly optimise the parameters of segmentation network, and the parameters of annotator network,
. In short, we minimise the negative log-likelihood of the probabilistic model plus a regularisation term via stochastic gradient descent. A detailed description is provided below.
Given training input and noisy labels for , we optimaize the parameters by minimizing the negative log-likelihood (NLL), . From eqs. (1) and (2), this optimization objective equates to the sum of cross-entropy losses between the observed noisy segmentations and the estimated annotator label distributions:
Minimizing the above encourages each annotator-specific predictions to be as close as possible to the true noisy label distribution of the annotator
. However, this loss function alone is not capable of separating the annotation noise from the true label distribution; there are many combinations of pairsand segmentation model such that perfectly matches the true annotator’s distribution for any input x (e.g., permutations of rows in the CMs). To combat this problem, inspired by Tanno et al.Tanno et al. (2019), which addressed an analogous issue for the classification task, we add the trace of the estimated CMs to the loss function in Eq. (3) as a regularisation term (see Sec 3.4). We thus optimize the combined loss:
where denotes the set of all labels available for image x, and denotes the trace of matrix A. The mean trace represents the average probability that a randomly selected annotator provides an accurate label. Intuitively, minimising the trace encourages the estimated annotators to be maximally unreliable while minimising the cross entropy ensures fidelity with observed noisy annotators. We minimise this combined loss via stochastic gradient descent to learn both .
3.4 Justification for the Trace Norm
Here we provide a further justification for using the trace regularisation. Tanno et al.Tanno et al. (2019) showed that if the average CM of annotators is diagonally dominant
, and the cross-entropy term in the loss function is zero, minimising the trace of the estimated CMs uniquely recovers the true CMs. However, their results concern properties of the average CMs of both the annotators and the classifier over the data population, rather than individual data samples. We show a similar but slightly weaker result in the sample-specific regime, which is more relevant as we estimate CMs of respective annotators on every input image.
First, let us set up the notations. For brevity, for a given input image , we denote the estimated CM of annotator at pixel by . We also define the mean CM and its estimate where is the probability that the annotator labels image . Lastly, as we stated earlier, we assume there is a single GT segmentation label per image — thus the true
-dimensional probability vector at pixeltakes the form of a one-hot vector i.e., for, say, class . Then, the followings result motivates the use of the trace regularisation:
If the annotator’s segmentation probabilities are perfectly modelled by the model for the given image i.e., , and the average true confusion matrix
, and the average true confusion matrixat a given pixel and its estimate satisfy that for and for all such that , then and such solutions are unique in the column where is the correct pixel class.
The corresponding proof is provided in the supplementary material. The above result shows that if each estimated annotator’s distribution is very close to the true noisy distribution (which is encouraged by minimizing the cross-entropy loss), and for a given pixel, the average CM has the k diagonal entry larger than any other entries in the same row 111For the standard “majority vote” label to capture the correct true labels, one requires the k diagonal element in the average CM to be larger than the sum of the remaining elements in the same row, which is a more strict condition., then minimizing its trace will drive the estimates of the (‘correct class’) columns in the respective annotator’s CMs to match the true values. Although this result is weaker than what was shown in Tanno et al. (2019) for the population setting rather than the individual samples, the single-ground-truth assumption means that the remaining values of the CMs are uniformly equal to , and thus it suffices to recover the column of the correct class.
To encourage to be also diagonally dominant, we initialize them with identity matrices by training the annotation network to maximise the trace for sufficient iterations as a warm-up period. Intuitively, the combination of the trace term and cross-entropy separates the true distribution from the annotation noise by finding the maximal amount of confusion which explains the noisy observations well.
We evaluate our method on a variety of datasets including both synthetic and real-world scenarios:1) for MNIST segmentation and ISBI2015 MS lesion segmentation challenge dataset Jesson and Arbel (2015), we apply morphological operations to generate synthetic noisy labels in binary segmentation tasks; 2) for BraTS 2019 dataset Menze et al. (2014), we apply similar simulation to create noisy labels in a multi-class segmentation task; 3) we also consider the LIDC-IDRI dataset which contains multiple annotations per input acquired from different clinical experts as the evaluation in practice. The etails of noisy label simulation can be found in Appendix A.1.
Our experiments are based on the assumption that no ground-truth (GT) label is not known a priori, hence, we compare our method against multiple label fusion methods. IN particular, we consider four label fusion baselines: a) mean of all of the noisy labels; b) mode labels by taking the “majority vote”; c) label fusion via the original STAPLE method Warfield et al. (2004); d) Spatial STAPLE, a more recent extension of c) that accounts for spatial variations in CMs. After curating the noisy annotations via above methods, we train the segmentation network and report the results. For c) and d), we used the toolkit222https://www.nitrc.org/projects/masi-fusion/. In addition, we also include a recent method called Probabilistic U-net as another baseline, which has been shown to capture inter-reader variations accurately. The details are presented in Appendix A.2.
For evaluation metrics, we use: 1) root-MSE between estimated CMs and real CMs; 2) Dice coefficient (DICE) between estimated segmentation and true segmentation; 3) The generalized energy distance proposed inKohl et al. (2018) to measure the quality of the estimated annotator’s labels.
4.1 MNIST and MS lesion segmentation datasets
MNIST dataset consists of 60,000 training and 10,000 testing examples, all of which are 28 28 grayscale images of digits from 0 to 9, and we derive the segmentation labels by thresholding the intensity values at 0.5. The MS dataset is publicly available and comprises 21 3D scans from 5 subjects. All scans are split into 10 for training and 11 for testing. We hold out 20% of training images as a validation set for both datasets. On both datasets, our proposed model achieves a higher dice similarity coefficient than STAPLE on the dense label case and, even more prominently, on the single label (i.e., 1 label per image) case (shown in Tables. 1&2 and Fig. 2). In addition, our model outperforms STAPLE without or with trace norm, in terms of CM estimation, specifically, we could achieve an increase at . Additionally, we include the performance on different regularisation coefficient, which is presented in Fig. 2(a). Fig. 2(b) compares the segmentation accuracy on MNIST and MS lesion for a range of average dice where labels are generated by a group of 5 simulated annotators. Fig. 3 illustrates our model can capture the patterns of mistakes for each annotator.
|Models||DICE (%)||CM estimation||DICE (%)||CM estimation|
|Naive CNN on mean labels||38.36 0.41||n/a||46.55 0.53||n/a|
|Naive CNN on mode labels||62.89 0.63||n/a||47.82 0.76||n/a|
|Probabilistic U-net Kohl et al. (2018)||65.12 0.83||n/a||46.15 0.59||n/a|
|Separate CNNs on annotators||70.44 0.65||n/a||46.84 1.24||n/a|
|STAPLE Warfield et al. (2004)||78.03 0.29||0.1241 0.0011||55.05 0.53||0.1502 0.0026|
|Spatial STAPLE Asman and Landman (2012)||78.96 0.22||0.1195 0.0013||58.37 0.47||0.1483 0.0031|
|Ours without Trace||79.63 0.53||0.1125 0.0037||65.77 0.62||0.1342 0.0053|
|Ours||82.92 0.19||0.0893 0.0009||67.55 0.31||0.0811 0.0024|
|Oracle (Ours but with known CMs)||83.29 0.11||0.0238 0.0005||78.86 0.14||0.0415 0.0017|
|Models||DICE (%)||CM estimation||DICE (%)||CM estimation|
|Naive CNN||32.79 1.13||n/a||27.41 1.45||n/a|
|STAPLE Warfield et al. (2004)||54.07 0.68||0.2617 0.0064||35.74 0.84||0.2833 0.0081|
|Spatial STAPLE Asman and Landman (2012)||56.73 0.53||0.2384 0.0061||38.21 0.71||0.2591 0.0074|
|Ours without Trace||74.48 0.37||0.1538 0.0029||54.76 0.66||0.1745 0.0044|
|Ours||76.48 0.25||0.1329 0.0012||56.43 0.47||0.1542 0.0023|
|Probabilistic U-net Kohl et al. (2018)||1.46 0.04||1.91 0.03||3.23 0.07||1.97 0.03|
|Ours||1.24 0.02||1.67 0.03||3.14 0.05||1.87 0.04|
4.2 BraTS Dataset and LIDC-IDRI Dataset
We also evaluate our model on a multi-class segmentation task, using all of the 259 high grade glioma (HGG) cases in training data from 2019 multi-modal Brain Tumour Segmentation Challenge (BraTS). We extract each slice as 2D images and split them at case-wise to have, 1600 images for training, 300 for validation and 500 for testing. Pre-processing includes: concatenation of all of available modalities; centre cropping to 192 x 192; normalisation for each case at each modality. To create synthetic noisy labels in multi-class scenario, we first choose a target class and then apply morphological operations on the provided GT mask to create 4 synthetic noisy labels at different patterns, namely, over-segmentation, under-segmentation, wrong segmentation and good segmentation. The details of noisy label simulation are in Appendix A.3.
Lastly, we use the LIDC-IDRI dataset to evaluate our method in the scenario where multiple labels are acquired from different clinical experts. The dataset contains 1018 lung CT scans from 1010 lung patients with manual lesion segmentations from four experts. For each scan, 4 radiologists provided annotation masks for lesions that they independently detected and considered to be abnormal. For our experiments, we use the same method in Kohl et al. (2018) to pre-process all scans. We split the dataset at case-wise into a training (722 patients), validation (144 patients) and testing (144 patients). We then resampled the CT scans to in-plane resolution. We also centre cropped 2D images ( pixels) around lesion positions, in order to focus on the annotated lesions. The lesion positions are those where at least one of the experts segmented a lesion. We hold 5000 images in the training set, 1000 images in the validation set and 1000 images in the test set. Since the dataset does not provide a single curated ground-truth for each image, we created a “gold standard” by aggregating the labels via STAPLE Asman and Landman (2012), a recent variant of the STAPLE framework employed in the creation of public medical image segmentation datasets e.g., ISLES Winzeck et al. (2018), MSSeg Commowick et al. (2018), Gleason’19 gle datasets. We further note that, as before, we assume labels are only available to the model during training, but not at test time, thus label aggregation methods cannot be applied on the test examples.
On both BraTS and LIDC-IDRI datasets, our proposed model achieves a higher dice similarity coefficient than STAPLE and Spatial STAPLE on both of the dense labels and single label scenarios (shown in Table. 4 and Table. 5 in Appendix A.3). In addition, our model (with trace) outperforms STAPLE in terms of CM estimation by a large margin at on BraTS. In Fig. 3(a), we visualized the segmentation results on BraTS and the corresponding annotators’ predictions. Fig. 3(b) presents three examples of the segmentation results and the corresponding four annotator contours, as well as the consensus. As shown in both figures, our model successfully predicts the both the segmentation of lesions and the variations of each annotator in different cases. We also measure the inter-reader consensus levels by computing the IoU of multiple annotations, and compare the segmentation performance in three subgroups of different consensus levels (low, medium and high). Results are shown in Fig. 10 and Fig. 11 in Appendix A.3.
Additionally, as shown in Table.3, our model consistently outperforms Probabilistic U-Net on generalized energy distance across the four test different datasets, indicating our method can better capture the inter-annotator variations than the baseline Probabilistic U-Net. This result shows that the information about which labels are acquired from whom is useful in modelling the variability in the observed segmentation labels.
We introduced the first learning method based on CNNs for simultaneously recovering the label noise of multiple annotators and the GT label distribution for supervised segmentation problems. We demonstrated this method on real-world datasets with synthetic annotations and real-world annotations. Our method is capable of estimating individual annotators and thereby improving robustness against label noise. Experiments have shown our model achieves considerable improvement over the traditional label fusion approaches including averaging, the majority vote and the widely used STAPLE framework and spatially varying versions, in terms of both segmentation accuracy and the quality of CM estimation.
In the future, we plan to accommodate meta-information of annotators (e.g., number of years of experience), and non-image data (e.g., genetics) that may influence the pattern of the underlying segmentation label such as lesion appearance, in our framework. We are also interested in assessing the utility of our approach in downstream applications. Of particular interest is the design of active data collection schemes where the segmentation model is used to select which samples to annotate (“active learning”), and the annotator models are used to decide which experts to label them (“active labelling”)Yan et al. (2010). Another exciting avenue of applications is education of inexperienced annotators; the estimated spatial characteristics of segmentation mistakes provide further insights into their annotation behaviours, which they may benefit from in improving their annotation quality.
We would like to thank Swami Sankaranarayanan and Ardavan Saeedi at Butterfly Network for their feedback and initial discussions. Mou-Cheng is supported by GSK funding (BIDS3000034123) via UCL EPSRC CDT in i4health and UCL Engineering Dean’s Prize. We are also grateful for EPSRC grants EP/R006032/1, EP/M020533/1, CRUK/EPSRC grant NS/A000069/1, and the NIHR UCLH Biomedical Research Centre, which support this work.
Lazarus et al. 
Elizabeth Lazarus, Martha B Mainiero, Barbara Schepps, Susan L Koelliker, and
Linda S Livingston.
Bi-rads lexicon for us and mammography: interobserver variability and positive predictive value.Radiology, 239(2):385–391, 2006.
- Watadani et al.  Takeyuki Watadani, Fumikazu Sakai, Takeshi Johkoh, Satoshi Noma, Masanori Akira, Kiminori Fujimoto, Alexander A Bankier, Kyung Soo Lee, Nestor L Müller, Jae-Woo Song, et al. Interobserver variability in the ct assessment of honeycombing in the lungs. Radiology, 266(3):936–944, 2013.
- Rosenkrantz et al.  Andrew B Rosenkrantz, Ruth P Lim, Mershad Haghighi, Molly B Somberg, James S Babb, and Samir S Taneja. Comparison of interreader reproducibility of the prostate imaging reporting and data system and likert scales for evaluation of multiparametric prostate mri. American Journal of Roentgenology, 201(4):W612–W618, 2013.
- Menze et al.  Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging, 34(10):1993–2024, 2014.
- Joskowicz et al.  Leo Joskowicz, D Cohen, N Caplan, and J Sosna. Inter-observer variability of manual contour delineation of structures in ct. European radiology, 29(3):1391–1399, 2019.
- Zhang et al.  Huahong Zhang, Alessandra M Valcarcel, Rohit Bakshi, Renxin Chu, Francesca Bagnato, Russell T Shinohara, Kilian Hett, and Ipek Oguz. Multiple sclerosis lesion segmentation with tiramisu and 2.5 d stacked slices. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 338–346. Springer, 2019.
- Kats et al.  Eytan Kats, Jacob Goldberger, and Hayit Greenspan. A soft staple algorithm combined with anatomical knowledge. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 510–517. Springer, 2019.
- Harvey and Glocker  Hugh Harvey and Ben Glocker. A standardised approach for preparing imaging data for machine learning tasks in radiology. In Artificial Intelligence in Medical Imaging, pages 61–72. Springer, 2019.
- Warfield et al.  Simon K Warfield, Kelly H Zou, and William M Wells. Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE transactions on medical imaging, 23(7):903–921, 2004.
- Winzeck et al.  Stefan Winzeck, Arsany Hakim, Richard McKinley, Jos AADSR Pinto, Victor Alves, Carlos Silva, Maxim Pisov, Egor Krivov, Mikhail Belyaev, Miguel Monteiro, et al. Isles 2016 and 2017-benchmarking ischemic stroke lesion outcome prediction based on multispectral mri. Frontiers in neurology, 9:679, 2018.
- Commowick et al.  Olivier Commowick, Audrey Istace, Michael Kain, Baptiste Laurent, Florent Leray, Mathieu Simon, Sorina Camarasu Pop, Pascal Girard, Roxana Ameli, Jean-Christophe Ferré, et al. Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Scientific reports, 8(1):1–17, 2018.
-  Gleason 2019 challenge. https://gleason2019.grand-challenge.org/Home/. Accessed: 2020-02-30.
- Asman and Landman  Andrew J Asman and Bennett A Landman. Robust statistical label fusion through consensus level, labeler accuracy, and truth estimation (collate). IEEE transactions on medical imaging, 30(10):1779–1794, 2011.
- Asman and Landman  Andrew J Asman and Bennett A Landman. Formulating spatially varying performance in the statistical fusion framework. IEEE transactions on medical imaging, 31(6):1326–1336, 2012.
- Iglesias et al.  Juan Eugenio Iglesias, Mert Rory Sabuncu, and Koen Van Leemput. A unified framework for cross-modality multi-atlas segmentation of brain mri. Medical image analysis, 17(8):1181–1191, 2013.
- Cardoso et al.  M Jorge Cardoso, Kelvin Leung, Marc Modat, Shiva Keihaninejad, David Cash, Josephine Barnes, Nick C Fox, Sebastien Ourselin, Alzheimer’s Disease Neuroimaging Initiative, et al. Steps: Similarity and truth estimation for propagated segmentations and its application to hippocampal segmentation and brain parcelation. Medical image analysis, 17(6):671–684, 2013.
- Asman and Landman  Andrew J Asman and Bennett A Landman. Non-local statistical label fusion for multi-atlas segmentation. Medical image analysis, 17(2):194–208, 2013.
- Akhondi-Asl et al.  Alireza Akhondi-Asl, Lennox Hoyte, Mark E Lockhart, and Simon K Warfield. A logarithmic opinion pool based staple algorithm for the fusion of segmentations with associated reliability weights. IEEE transactions on medical imaging, 33(10):1997–2009, 2014.
- Castro et al.  Daniel C. Castro, Jeremy Tan, Bernhard Kainz, Ender Konukoglu, and Ben Glocker. Morpho-MNIST: Quantitative assessment and diagnostics for representation learning. Journal of Machine Learning Research, 20, 2019.
- Styner et al.  Martin Styner, Joohwi Lee, Brian Chin, M Chin, Olivier Commowick, H Tran, S Markovic-Plese, V Jewells, and S Warfield. 3d segmentation in the clinic: A grand challenge ii: Ms lesion segmentation. Midas Journal, 2008:1–6, 2008.
- Armato III et al.  Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011.
- Weisenfeld and Warfield  Neil I Weisenfeld and Simon K Warfield. Learning likelihoods for labeling (l3): a general multi-classifier segmentation algorithm. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 322–329. Springer, 2011.
- Joskowicz et al.  Leo Joskowicz, D Cohen, N Caplan, and Jacob Sosna. Automatic segmentation variability estimation with segmentation priors. Medical image analysis, 50:54–64, 2018.
- Kohl et al.  Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus Maier-Hein, SM Ali Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. A probabilistic u-net for segmentation of ambiguous images. In Advances in Neural Information Processing Systems, pages 6965–6975, 2018.
- Baumgartner et al.  Christian F Baumgartner, Kerem C Tezcan, Krishna Chaitanya, Andreas M Hötker, Urs J Muehlematter, Khoschy Schawkat, Anton S Becker, Olivio Donati, and Ender Konukoglu. Phiseg: Capturing uncertainty in medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 119–127. Springer, 2019.
- Raykar et al.  Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
- Khetan et al.  Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-labeled data. arXiv preprint arXiv:1712.04577, 2017.
- Tanno et al.  Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. arXiv preprint arXiv:1902.03680, 2019.
- Dawid and Skene  Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979.
- Smyth et al.  Padhraic Smyth, Usama M Fayyad, Michael C Burl, Pietro Perona, and Pierre Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in neural information processing systems, pages 1085–1092, 1995.
- Whitehill et al.  Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
- Welinder et al.  Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432, 2010.
- Rodrigues et al.  Filipe Rodrigues, Francisco Pereira, and Bernardete Ribeiro. Learning from multiple annotators: distinguishing good from random labelers. Pattern Recognition Letters, 34(12):1428–1436, 2013.
- Raykar et al.  Vikas C Raykar, Shipeng Yu, Linda H Zhao, Anna Jerebko, Charles Florin, Gerardo Hermosillo Valadez, Luca Bogoni, and Linda Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings of the 26th Annual international conference on machine learning, pages 889–896. ACM, 2009.
- Yan et al.  Yan Yan, Rómer Rosales, Glenn Fung, Mark Schmidt, Gerardo Hermosillo, Luca Bogoni, Linda Moy, and Jennifer Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. In AISTATs, pages 932–939, 2010.
Branson et al. 
Steve Branson, Grant Van Horn, and Pietro Perona.
Lean crowdsourcing: Combining humans and machines in an online
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7474–7483, 2017.
- Van Horn et al.  Grant Van Horn, Steve Branson, Scott Loarie, Serge Belongie, Cornell Tech, and Pietro Perona. Lean multiclass crowdsourcing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2714–2723, 2018.
Jesson and Arbel 
Andrew Jesson and Tal Arbel.
Hierarchical mrf and random forest segmentation of ms lesions and healthy tissues in brain mri.Proceedings of the 2015 Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, pages 1–2, 2015.
- Chandra et al.  Siddhartha Chandra, Nicolas Usunier, and Iasonas Kokkinos. Dense and low-rank gaussian crfs using deep embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pages 5103–5112, 2017.
- Fey and Lenssen  Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
- Ronneberger et al.  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Sukhbaatar et al.  Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
Appendix A Additional results
a.1 Annotation Simulation Details
We generate synthetic annotations from an assumed GT on MNIST, MS lesion and BraTS datasets, to generate efficacy of the approach in an idealised situation where the GT is known. We simulate a group of 5 annotators of disparate characteristics by performing morphological transformations (e.g., thinning, thickening, fractures, etc) on the ground-truth (GT) segmentation labels, using Morpho-MNIST software Castro et al. . In particular, the first annotator provides faithful segmentation (“good-segmentation”) with approximate GT, the second tends over-segment (“over-segmentation”), the third tends to under-segment (“under-segmentation”), the fourth is prone to the combination of small fractures and over-segmentation (“wrong-segmentation”) and the fifth always annotates everything as the background (“blank-segmentation”). To create synthetic noisy labels in multi-class scenario, we first choose a target class and then apply morphological operations on the provided GT mask to create 4 synthetic noisy labels at different patterns, namely, over-segmentation, under-segmentation, wrong segmentation and good segmentation. We create training data by deriving labels from the simulated annotators. We also experimented with varying the levels of morphological operations on MNIST and MS lesion datasets, to test the robustness of our methods to varying degrees of annotation noise.
a.2 Additional Qualitative Results on MNIST and MS Dataset
Here we provide additional qualitative comparison of segmentation results and CM visualization results on MNIST and MS datasets. We examine the ability of our method to learn the CMs of annotators and the true label distribution on single label per image. Fig. 3 and Fig. 5 show the segmentation results on MNIST dataset on single label per image. Our model achieved a higher dice similarity coefficient than STAPLE and Spatial STAPLE, even prominently, our model outperformed STAPLE and Spatial STAPLE without or with trace norm, in terms of CM estimation. Fig. 4 and Fig. 6 illustrate our model on single label still can capture the patterns of mistakes.
a.3 Quantitative and Extra Qualitative Results on BraTS and LIDC-IDRI
Here we provide the quantitative comparison of our method and other baselines on BraTS and LIDC-IDRI datasets, which have been precluded from the main text due to the space limit (see Table. 4 and Table. 5). We also provide additional qualitative examples (see Fig. 7,8, 9) on both datasets. Lastly, we compare the segmentation performance on 3 different subgroups of LIDC-IDRI with varying levels of inter-reader variability; Fig. 11 illustrates our method attains consistent improvement over the baselines in all cases, indicating its ability to segment more robustly even the hard examples where the experts in reality have disagreed to a large extent.
BraTS 2019 is a multi-class segmentation dataset, containing 259 cases with high grade (HG) and 76 cases with low grade (LG) glioma (a type of brain tumour). For each case, four MRI modalities are available, FLAIR, T1, T1-contrast and T2. The datasets are pre-processed by the organizers and co-registered to the same anatomical template, interpolated to the same resolution (1) and skull-stripped. We centre cropped 2D images (192
192 pixels) and hold 1600 2D images for training, 300 images for validation, 500 images for testing, we apply Gaussian normalization on each case of each modality, to have zero-mean and unit variance. Fig.7 shows another tumor case in four different modality with different target label. We also present several example results on different methods in Fig. 8.
To demonstrate the performance on a dataset with real-world annotations, we have also evaluated our model on LIDC-IDRI. The “ground truth” labels in the experiments are generated by aggregating the multiple labels via Spatial STAPLEAsman and Landman  as used in the curation of existing public datasets e.g., ISLES Winzeck et al. , MSSeg Commowick et al. , Gleason’19 gle . Fig. 9 presents several examples of segmentation results from different methods. We also measure the inter-reader consensus level by computing the IoU of annotations, and compare in Fig. 10 the estimates from our model against the values meansured on the real annotations. Furthermore, we divide the test dataset into low consensus (30% to 65%), middle consensus (65% to 75%) and high consensus (75% to 90%) subgroups and compare the performance in Fig. 11. Our method shows competitive ability to segment the challenging examples with low consensus values. Here we note that the consensus values in our test data range from 30% to 90%,, and compared the dice coefficient of our model with baselines.
On both BraTS and LIDC-IDRI dataset, our proposed model consistenly achieves a higher dice similarity coefficient than STAPLE on both of the dense labels and single label scenarios (shown in Table. 4 and Table. 5). In addition, our model (with trace) outperforms STAPLE in terms of CM estimation by a large margin at on BraTS. In Fig. 7, we visualized the segmentation results on BraTS and the corresponding annotators’ predictions. Fig. 8 presents four examples of the segmentation results and the corresponding annotators’ predictions, as well as the baseline methods. As shown in both figures, our model successfully predicts the both the segmentation of lesions and the variations of each annotator in different cases.
|Models||DICE (%)||CM estimation||DICE (%)||CM estimation|
|Naive CNN on mean labels||29.42 0.58||n/a||56.72 0.61||n/a|
|Naive CNN on mode labels||34.12 0.45||n/a||58.64 0.47||n/a|
|Probabilistic U-net Kohl et al. ||40.53 0.75||n/a||61.26 0.69||n/a|
|STAPLE Warfield et al. ||46.73 0.17||0.2147 0.0103||69.34 0.58||0.0832 0.0043|
|Spatial STAPLE Asman and Landman ||47.31 0.21||0.1871 0.0094||70.92 0.18||0.0746 0.0057|
|Ours without Trace||49.03 0.34||0.1569 0.0072||71.25 0.12||0.0482 0.0038|
|Ours||53.47 0.24||0.1185 0.0056||74.12 0.19||0.0451 0.0025|
|Oracle (Ours but with known CMs)||67.13 0.14||0.0843 0.0029||79.41 0.17||0.0381 0.0021|
|Models||DICE (%)||CM estimation||DICE (%)||CM estimation|
|Naive CNN on mean & mode labels||36.12 0.93||n/a||48.36 0.79||n/a|
|STAPLE Warfield et al. ||38.74 0.85||0.2956 0.1047||57.32 0.87||0.1715 0.0134|
|Spatial STAPLE Asman and Landman ||41.59 0.74||0.2543 0.0867||62.35 0.64||0.1419 0.0207|
|Ours without Trace||43.74 0.49||0.1825 0.0724||66.95 0.51||0.0921 0.0167|
|Ours||46.21 0.28||0.1576 0.0487||68.12 0.48||0.0587 0.0098|
a.4 Low-rank Approximation
Here we show our preliminery results on the employed low-rank approximation of confusion matrices for BraTS dataset, precluded in the main text. Table. 6 compares the performance of our method with the default implementation and the one with rank-1 approximation. We see that the low-rank approximation can halve the number of parameters in CMs and the number of floating-point-operations (FLOPs) in computing the annotator prediction while resonably retaining the performance on both segmentation and CM estimation. We note, however, the practical gain of this approximation in this task is limited since the number of classes is limited to 4 as indicated by the marginal reduction in the overall GPU usage for one example. We expect the gain to increase when the number of classes is larger as shown in Fig. 12.
|Rank||Dice||CM estimation||GPU Memory||No. Parameters||FLOPs|
|Default||53.47 0.24||0.1185 0.0056||2.68GB||589824||1032192|
|rank 1||50.56 2.00||0.1925 0.0314||2.57GB||294912||405504|
Lastly, we also describe the details of the devised low-rank approximation. Analogous to Chandra and Kokkinos’s work Chandra et al.  where they employed a similar approximation for estimating the pairwise terms in densely connected CRF, we parametrise the spatial CM, as a product of two smaller rectangular matrices and of size where . In this case, the annotator network outputs and for each annotator in lieu of the full CM. Two separate rectangular matrices are used here since the confusion matrices are not necessarily symmetric. Such low-rank approximation reduces the total number of variables to from and the number of floating-point operations (FLOPs) to from . Fig. 12 shows that the time and space complexity of the default method grow quadratically in the number of classes while the low-rank approximations have linear growth.
Appendix B Implementation details
Our method is implemented in Pytorch 1.0Fey and Lenssen . Our network is based on a 4 down-sampling stages 2D U-net Ronneberger et al. , the channel numbers for each encoders are 32, 64, 128, 256, we also replaced the batch normalisation layers with instance normalisation. Our segmentation network and annotator network share the same parameters apart from the last layer in the decoder of U-net, essentially, the overall architecture is implemented as an U-net with multiple output last layers: one for prediction of true segmentation; others for predictions of noisy segmentation respectively. For segmentation network, the output of the last layer has c channels where c is the number of classes. On the other hand, for annotator network, by default, the output of the last layer has number of channels for estimating confusion matrices at each spatial location; when low-rank approximation is used, the output of the last layer has 2 L number of channels. The Probabilistic U-net implementation is adopted from https://github.com/stefanknegt/Probabilistic-Unet-Pytorch, for fair comparison, we adjusted the number of the channels and the depth of the U-net backbone in Probabilistic U-net to match with our networks. All of the models were trained on a NVIDIA RTX 208 for at least 3 times with different random initialisations to compute the mean performance and its standard deviation. The Adam Kingma and Ba  optimiser was used in all experiments with the default hyper-parameter settings. We also provide all of the hyper-parameters of the experiments for each data set in Table 7. We also kept the training details the same between the baselines and our method.
|Data set||Learning Rate||Epoch||Batch Size||Augmentation||weight for regularisation ()|
b.1 Pytorch implementation of loss function
The following is the Pytorch implementation of the loss function in eq. (4). We also intend to clean up the whole codebase and release in the final version.
(torch.tensor):unnormalised probabilities from the segmentation network
Appendix C Proof of Theorem 1
We first show a specific case of Theorem 1 when there is only a single annotator, and subsequently extend it to the scenario with multiple annotators. Without loss of generality, we show the result for an arbitrary choice of a pixel in a given input image . Specifically, let us denote the estimated confusion matrix (CM) of the annotator at the pixel by , and suppose the true class of this pixel is i.e., where denotes the elementary basis. Let denote the -dimensional estimated label distribution at the corresponding pixel (instead of over all the whole image).
If the annotator’s segmentation probability is fully captured by the model for the pixel in image i.e., , and both , A satisfy that for and for all such that , then is minimised when . Furthermore, if , then the true label is fully recovererd i.e., and the column in , A are the same.
We first show that the diagonal element in A is smaller than or equal to its estimate in . Since is a one-hot vector, holds and , it follows that:
The possibility of equality in the above comes from the fact that all entries in except the th element could be zeros. Now, the assumption that there is a single ground truth label for the pixel means that all the values of the true CM, A are uniformly equal to except the column. In addition, since the diagonal dominance of the estimated CM means each is at least , we have that
It therefore follows that when holds, the trace of is the smallest. Now, we show that when this holds i.e., , then the columns of the two matrices match up.
By way of contradiction, let us assume that there exists a class for which the estimated label probability is non-zero i.e., . This implies that . From eq. (6), if the trace of A and are the same, then also holds and thus we have . By rearranging this equality and dividing both sides by , we obtain . Now, as we have , it follows that
which is false. Therefore, the trace quality implies and thus from , we conclude that the columns of and A are the same.
We note that the equivalent result for the expectation of the annotator’s CM over the data population was provided in Sukhbaatar et al.  and Tanno et al. . The main difference is, as described in the main text, that we show a slightly weaker version of their result in a sample-specific scenario.
Now, we show that the main theorem follows naturally from the above lemma. As a reminder, we recite the theorem below.
Theorem 1. For the pixel in a given image , we define the mean confusion matrix (CM) and its estimate where is the probability that the annotator labels image . If the annotator’s segmentation probabilities are perfectly modelled by the model for the given image i.e., , and the average true confusion matrix at a given pixel and its estimate satisfy that for and for all such that , then and such solutions are unique in the columns where is the correct pixel class.
A direct application of Lemma 1 shows firstly that is minimised when for all (since that ensures ). Secondly, it implies that minimising yields . Because we assume that annotators’ noisy labels are correctly modelled i.e., , it therefore follows that the column in and are the same.