1 Introduction
Segmentation of anatomical structures in medical images is known to suffer from high interreader variability Lazarus et al. (2006); Watadani et al. (2013); Rosenkrantz et al. (2013); Menze et al. (2014); Joskowicz et al. (2019)
, influencing the performance of downstream supervised machine learning models. This problem is particularly prominent in the medical domain where the labelled data is commonly scarce due to the high cost of annotations. For instance, accurate identification of multiple sclerosis (MS) lesions in MRIs is difficult even for experienced experts due to variability in lesion location, size, shape and anatomical variability across patients
Zhang et al. (2019). Another example Menze et al. (2014) reports the average interreader variability in the range 7485% for glioblastoma (a type of brain tumour) segmentation. Further aggravated by differences in biases and levels of expertise, segmentation annotations of structures in medical images suffer from high annotation variations Kats et al. (2019). In consequence, despite the present abundance of medical imaging data thanks to over two decades of digitisation, the world still remains relatively short of access to data with curated labels Harvey and Glocker (2019), that is amenable to machine learning, necessitating intelligent methods to learn robustly from such noisy annotations.To mitigate interreader variations, different preprocessing techniques are commonly used to curate segmentation annotations by fusing labels from different experts. The most basic yet popular approach is based on the majority vote where the most representative opinion of the experts is treated as the ground truth (GT). A smarter version that accounts for similarity of classes has proven effective in aggregation of brain tumour segmentation labels Menze et al. (2014). A key limitation of such approaches, however, is that all experts are assumed to be equally reliable. Warfield et al.Warfield et al. (2004) proposed a label fusion method, called STAPLE that explicitly models the reliability of individual experts and uses that information to “weigh” their opinions in the label aggregation step. After consistent demonstration of its superiority over the standard majorityvote preprocessing in multiple applications, STAPLE has become the goto label fusion method in the creation of public medical image segmentation datasets e.g., ISLES Winzeck et al. (2018), MSSeg Commowick et al. (2018), Gleason’19 gle datasets. Asman et al.later extended this approach in Asman and Landman (2011) by accounting for voxelwise consensus to address the issue of underestimation of annotators’ reliability. In Asman and Landman (2012), another extension was proposed in order to model the reliability of annotators across different pixels in images. More recently, within the context of multiatlas segmentation problems Iglesias et al. (2013) where image registration is used to warp segments from labeled images (“atlases”) onto a new scan, STAPLE has been enhanced in multiple ways to encode the information of the underlying images into the label aggregation process. A notable example is STEP proposed in Cardoso et al.Cardoso et al. (2013) who designed a strategy to further incorporate the local morphological similarity between atlases and target images, and different extensions of this approach such as Asman and Landman (2013); AkhondiAsl et al. (2014) have since been considered. However, these previous label fusion approaches have a common drawback—they critically lack a mechanism to integrate information across different training images. This fundamentally limits the remit of applications to cases where each image comes with a reasonable number of annotations from multiple experts, which can be prohibitively expensive in practice. Moreover, relatively simplistic functions are used to model the relationship between observed noisy annotations, true labels and reliability of experts, which may fail to capture complex characteristics of human annotators.
In this work, we introduce the first instance of an endtoend supervised segmentation method that jointly estimates, from noisy labels alone, the reliability of multiple human annotators and true segmentation labels. The proposed architecture (Fig. 1
) consists of two coupled CNNs where one estimates the true segmentation probabilities and the other models the characteristics of individual annotators (e.g., tendency to oversegmentation, mixup between different classes, etc) by estimating the pixelwise confusion matrices (CMs) on a per image basis. Unlike STAPLE
Warfield et al. (2004)and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the true segmentation label. Furthermore, the parameters of the CNNs are “global variables” that are optimised across different image samples; this enables the model to disentangle robustly the annotators’ mistakes and the true labels based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image). In contrast, this would not be possible with STAPLE
Warfield et al. (2004) and its variants Asman and Landman (2012); Cardoso et al. (2013) where the annotators’ parameters are estimated on every target image separately.For evaluation, we first simulate a diverse range of annotator types on the MNIST dataset by performing morphometric operations with MorphoMNIST framework Castro et al. (2019). Then we demonstrate the potential in several realworld medical imaging datasets, namely (i) MS lesion segmentation dataset (MSLSC) from the ISBI 2015 challenge Styner et al. (2008), (ii) Brain tumour segmentation dataset (BraTS) Menze et al. (2014) and (iii) Lung nodule segmentation dataset (LIDCIDRI) Armato III et al. (2011). Experiments on all datasets demonstrate that our method consistently leads to better segmentation performance compared to widely adopted labelfusion methods and other relevant baselines, especially when the number of available labels for each image is low and the degree of annotator disagreement is high.
2 Related Work
The majority of algorithmic innovations in the space of label aggregation for segmentation have uniquely originated from the medical imaging community, partly due to the prominence of the interreader variability problem in the field, and the widereaching values of reliable segmentation methods Asman and Landman (2012). The aforementioned methods based on the STAPLEframework such as Warfield et al. (2004); Asman and Landman (2011, 2012); Cardoso et al. (2013); Weisenfeld and Warfield (2011); Asman and Landman (2013, 2013); AkhondiAsl et al. (2014); Joskowicz et al. (2018) are based on generative models of human behaviours, where the latent variables of interest are the unobserved true labels and the “reliability” of the respective annotators. Our method can be viewed as an instance of translation of the STAPLEframework to the supervised learning paradigm. As such, our method produces a model that can segment test images without needing to acquire labels from annotators or atlases unlike STAPLE and its local variants. Another key difference is that our method is jointly trained on many different subjects while the STAPLEvariants are only fitted on a persubject basis. This means that our method is able to learn from correlations between different subjects, which previous works have not attempted— for example, our method uniquely can estimate the reliability and true labels even when there is only one label available per input image as shown later.
Our work also relates to a recent strand of methods that aim to generate a set of diverse and plausible segmentation proposals on a given image. Notably, probabilistic Unet Kohl et al. (2018) and its recent variants, PHiSeg Baumgartner et al. (2019) have shown that the aforementioned interreader variations in segmentation labels can be modelled with sophisticated forms of probabilistic CNNs. Such approaches, however, fundamentally differ from ours in that variable annotations from many experts in the training data are assumed to be all realistic instances of the true segmentation; we assume, on the other hand, that there is a single, unknown, true segmentation map of the underlying anatomy, and each individual annotator produces a noisy approximation to it with variations that reflect their individual characteristics. The latter assumption may be reasonable in the context of segmentation problems since there exists only one true boundary of the physical objects captured in an image while multiple hypothesis can arise from ambiguities in human interpretations.
We also note that, in standard classification problems, a plethora of different works have shown the utility of modelling the labeling process of human annotators in restoring the true label distribution Raykar et al. (2010); Khetan et al. (2017); Tanno et al. (2019). Such approaches can be categorized into two groups: (1) twostage approach Dawid and Skene (1979); Smyth et al. (1995); Whitehill et al. (2009); Welinder et al. (2010); Rodrigues et al. (2013), and (2) simultaneous approach Raykar et al. (2009); Yan et al. (2010); Branson et al. (2017); Van Horn et al. (2018); Khetan et al. (2017); Tanno et al. (2019). In the first category, the noisy labels are first curated through a probabilistic model of annotators, and subsequently, a supervised machinelearning model is trained on the curated labels. The initial attempt Dawid and Skene (1979) was made in the early 1970s, and numerous advances such as Smyth et al. (1995); Whitehill et al. (2009); Welinder et al. (2010); Rodrigues et al. (2013) since built upon this work e.g. by estimating sample difficulty and human biases. In contrast, models in the second category aim to curate labels and learn a supervised model jointly in an endtoend fashion Raykar et al. (2009); Yan et al. (2010); Branson et al. (2017); Van Horn et al. (2018); Khetan et al. (2017); Tanno et al. (2019) so that the two components inform each other. Although the evidence still remains limited to the simple classification task, these simultaneous approaches have shown promising improvements over the methods in the first category in terms of the predictive performance of the supervised model and the sample efficiency (i.e., fewer labels are required per input). However, to date very little attention has been paid to the same problem in more complicated, structured prediction tasks where the outputs are high dimensional. In this work, we propose the first simultaneous approach to addressing such a problem for image segmentation, while drawing inspirations from the STAPLE framework Warfield et al. (2004) which would fall into the twostage approach category.
3 Method
3.1 Problem Setup
In this work, we consider the problem of learning a supervised segmentation model from noisy labels acquired from multiple human annotators. Specifically, we consider a scenario where set of images (with denoting the width, height and channels of the image) are assigned with noisy segmentation labels from multiple annotators where denotes the label from annotator and denotes the set of all annotators who labelled image and denotes the set of classes.
Here we assume that every image annotated by at least one person i.e., , and no GT labels are available. The problem of interest here is to learn the unobserved true segmentation distribution from such noisy labelled dataset i.e., the combination of images, noisy annotations and experts’ identities for labels (which label was obtained from whom).
We also emphasise that the goal at inference time is to segment a given unlabelled test image but not to fuse multiple available labels as is typically done in multiatlas segmentation approaches Iglesias et al. (2013).
3.2 Probabilistic Model and Proposed Architecture
Here we describe the probabilistic model of the observed noisy labels from multiple annotators. We make two key assumptions: (1) annotators are statistically independent, (2) annotations over different pixels are independent given the input image. Under these assumptions, the probability of observing noisy labels on x factorises as:
(1) 
where denotes the elements of . Now we rewrite the probability of observing each noisy label on each pixel as:
(2) 
where denotes the GT label distribution over the pixel in the image x, and describes the noisy labelling process by which annotator corrupts the true segmentation label. In particular, we refer to the matrix whose each element is defined by the second term as the CM of annotator at pixel in image .
We introduce a CNNbased architecture which models the different constituents in the above joint probability distribution
as illustrated in Fig. 1. The model consists of two components: (1) Segmentation Network, parametrised by , which estimates the GT segmentation probability map, whose each element approximates ;(2) Annotator Network, parametrised by , that generate estimates of the pixelwise CMs of respective annotators as a function of the input image, whose each element approximates . Each product represents the estimated segmentation probability map of the corresponding annotator. Note that here “” denotes the elementwise matrix multiplications in the spatial dimensions . At inference time, we use the output of the segmentation network to segment test images.We note that each spatial CM, contains variables, and calculating the corresponding annotator’s prediction requires floatingpoint operations, potentially incurring a large time/space cost when the number of classes is large. Although not the focus of this work (as we are concerned with medical imaging applications for which the number of classes are mostly limited to less than 10), we also consider a lowrank approximation (rank) scheme to alleviate this issue wherever appropriate. More details are provided in the supplementary.
3.3 Learning Spatial Confusion Matrices and True Segmentation
Next, we describe how we jointly optimise the parameters of segmentation network, and the parameters of annotator network,
. In short, we minimise the negative loglikelihood of the probabilistic model plus a regularisation term via stochastic gradient descent. A detailed description is provided below.
Given training input and noisy labels for , we optimaize the parameters by minimizing the negative loglikelihood (NLL), . From eqs. (1) and (2), this optimization objective equates to the sum of crossentropy losses between the observed noisy segmentations and the estimated annotator label distributions:
(3) 
Minimizing the above encourages each annotatorspecific predictions to be as close as possible to the true noisy label distribution of the annotator
. However, this loss function alone is not capable of separating the annotation noise from the true label distribution; there are many combinations of pairs
and segmentation model such that perfectly matches the true annotator’s distribution for any input x (e.g., permutations of rows in the CMs). To combat this problem, inspired by Tanno et al.Tanno et al. (2019), which addressed an analogous issue for the classification task, we add the trace of the estimated CMs to the loss function in Eq. (3) as a regularisation term (see Sec 3.4). We thus optimize the combined loss:(4) 
where denotes the set of all labels available for image x, and denotes the trace of matrix A. The mean trace represents the average probability that a randomly selected annotator provides an accurate label. Intuitively, minimising the trace encourages the estimated annotators to be maximally unreliable while minimising the cross entropy ensures fidelity with observed noisy annotators. We minimise this combined loss via stochastic gradient descent to learn both .
3.4 Justification for the Trace Norm
Here we provide a further justification for using the trace regularisation. Tanno et al.Tanno et al. (2019) showed that if the average CM of annotators is diagonally dominant
, and the crossentropy term in the loss function is zero, minimising the trace of the estimated CMs uniquely recovers the true CMs. However, their results concern properties of the average CMs of both the annotators and the classifier over the data population, rather than individual data samples. We show a similar but slightly weaker result in the samplespecific regime, which is more relevant as we estimate CMs of respective annotators on every input image.
First, let us set up the notations. For brevity, for a given input image , we denote the estimated CM of annotator at pixel by . We also define the mean CM and its estimate where is the probability that the annotator labels image . Lastly, as we stated earlier, we assume there is a single GT segmentation label per image — thus the true
dimensional probability vector at pixel
takes the form of a onehot vector i.e., for, say, class . Then, the followings result motivates the use of the trace regularisation:Theorem 1.
If the annotator’s segmentation probabilities are perfectly modelled by the model for the given image i.e.,
, and the average true confusion matrix
at a given pixel and its estimate satisfy that for and for all such that , then and such solutions are unique in the column where is the correct pixel class.The corresponding proof is provided in the supplementary material. The above result shows that if each estimated annotator’s distribution is very close to the true noisy distribution (which is encouraged by minimizing the crossentropy loss), and for a given pixel, the average CM has the k diagonal entry larger than any other entries in the same row ^{1}^{1}1For the standard “majority vote” label to capture the correct true labels, one requires the k diagonal element in the average CM to be larger than the sum of the remaining elements in the same row, which is a more strict condition., then minimizing its trace will drive the estimates of the (‘correct class’) columns in the respective annotator’s CMs to match the true values. Although this result is weaker than what was shown in Tanno et al. (2019) for the population setting rather than the individual samples, the singlegroundtruth assumption means that the remaining values of the CMs are uniformly equal to , and thus it suffices to recover the column of the correct class.
To encourage to be also diagonally dominant, we initialize them with identity matrices by training the annotation network to maximise the trace for sufficient iterations as a warmup period. Intuitively, the combination of the trace term and crossentropy separates the true distribution from the annotation noise by finding the maximal amount of confusion which explains the noisy observations well.
4 Experiments
We evaluate our method on a variety of datasets including both synthetic and realworld scenarios:1) for MNIST segmentation and ISBI2015 MS lesion segmentation challenge dataset Jesson and Arbel (2015), we apply morphological operations to generate synthetic noisy labels in binary segmentation tasks; 2) for BraTS 2019 dataset Menze et al. (2014), we apply similar simulation to create noisy labels in a multiclass segmentation task; 3) we also consider the LIDCIDRI dataset which contains multiple annotations per input acquired from different clinical experts as the evaluation in practice. The etails of noisy label simulation can be found in Appendix A.1.
Our experiments are based on the assumption that no groundtruth (GT) label is not known a priori, hence, we compare our method against multiple label fusion methods. IN particular, we consider four label fusion baselines: a) mean of all of the noisy labels; b) mode labels by taking the “majority vote”; c) label fusion via the original STAPLE method Warfield et al. (2004); d) Spatial STAPLE, a more recent extension of c) that accounts for spatial variations in CMs. After curating the noisy annotations via above methods, we train the segmentation network and report the results. For c) and d), we used the toolkit^{2}^{2}2https://www.nitrc.org/projects/masifusion/. In addition, we also include a recent method called Probabilistic Unet as another baseline, which has been shown to capture interreader variations accurately. The details are presented in Appendix A.2.
For evaluation metrics, we use: 1) rootMSE between estimated CMs and real CMs; 2) Dice coefficient (DICE) between estimated segmentation and true segmentation; 3) The generalized energy distance proposed in
Kohl et al. (2018) to measure the quality of the estimated annotator’s labels.4.1 MNIST and MS lesion segmentation datasets
MNIST dataset consists of 60,000 training and 10,000 testing examples, all of which are 28 28 grayscale images of digits from 0 to 9, and we derive the segmentation labels by thresholding the intensity values at 0.5. The MS dataset is publicly available and comprises 21 3D scans from 5 subjects. All scans are split into 10 for training and 11 for testing. We hold out 20% of training images as a validation set for both datasets. On both datasets, our proposed model achieves a higher dice similarity coefficient than STAPLE on the dense label case and, even more prominently, on the single label (i.e., 1 label per image) case (shown in Tables. 1&2 and Fig. 2). In addition, our model outperforms STAPLE without or with trace norm, in terms of CM estimation, specifically, we could achieve an increase at . Additionally, we include the performance on different regularisation coefficient, which is presented in Fig. 2(a). Fig. 2(b) compares the segmentation accuracy on MNIST and MS lesion for a range of average dice where labels are generated by a group of 5 simulated annotators. Fig. 3 illustrates our model can capture the patterns of mistakes for each annotator.
MNIST  MNIST  MSLesion  MSLesion  
Models  DICE (%)  CM estimation  DICE (%)  CM estimation 
Naive CNN on mean labels  38.36 0.41  n/a  46.55 0.53  n/a 
Naive CNN on mode labels  62.89 0.63  n/a  47.82 0.76  n/a 
Probabilistic Unet Kohl et al. (2018)  65.12 0.83  n/a  46.15 0.59  n/a 
Separate CNNs on annotators  70.44 0.65  n/a  46.84 1.24  n/a 
STAPLE Warfield et al. (2004)  78.03 0.29  0.1241 0.0011  55.05 0.53  0.1502 0.0026 
Spatial STAPLE Asman and Landman (2012)  78.96 0.22  0.1195 0.0013  58.37 0.47  0.1483 0.0031 
Ours without Trace  79.63 0.53  0.1125 0.0037  65.77 0.62  0.1342 0.0053 
Ours  82.92 0.19  0.0893 0.0009  67.55 0.31  0.0811 0.0024 
Oracle (Ours but with known CMs)  83.29 0.11  0.0238 0.0005  78.86 0.14  0.0415 0.0017 
MNIST  MNIST  MSLesion  MSLesion  
Models  DICE (%)  CM estimation  DICE (%)  CM estimation 
Naive CNN  32.79 1.13  n/a  27.41 1.45  n/a 
STAPLE Warfield et al. (2004)  54.07 0.68  0.2617 0.0064  35.74 0.84  0.2833 0.0081 
Spatial STAPLE Asman and Landman (2012)  56.73 0.53  0.2384 0.0061  38.21 0.71  0.2591 0.0074 
Ours without Trace  74.48 0.37  0.1538 0.0029  54.76 0.66  0.1745 0.0044 
Ours  76.48 0.25  0.1329 0.0012  56.43 0.47  0.1542 0.0023 
Models  MNIST  MS  BraTS  LIDCIDRI 
Probabilistic Unet Kohl et al. (2018)  1.46 0.04  1.91 0.03  3.23 0.07  1.97 0.03 
Ours  1.24 0.02  1.67 0.03  3.14 0.05  1.87 0.04 
4.2 BraTS Dataset and LIDCIDRI Dataset
We also evaluate our model on a multiclass segmentation task, using all of the 259 high grade glioma (HGG) cases in training data from 2019 multimodal Brain Tumour Segmentation Challenge (BraTS). We extract each slice as 2D images and split them at casewise to have, 1600 images for training, 300 for validation and 500 for testing. Preprocessing includes: concatenation of all of available modalities; centre cropping to 192 x 192; normalisation for each case at each modality. To create synthetic noisy labels in multiclass scenario, we first choose a target class and then apply morphological operations on the provided GT mask to create 4 synthetic noisy labels at different patterns, namely, oversegmentation, undersegmentation, wrong segmentation and good segmentation. The details of noisy label simulation are in Appendix A.3.
Lastly, we use the LIDCIDRI dataset to evaluate our method in the scenario where multiple labels are acquired from different clinical experts. The dataset contains 1018 lung CT scans from 1010 lung patients with manual lesion segmentations from four experts. For each scan, 4 radiologists provided annotation masks for lesions that they independently detected and considered to be abnormal. For our experiments, we use the same method in Kohl et al. (2018) to preprocess all scans. We split the dataset at casewise into a training (722 patients), validation (144 patients) and testing (144 patients). We then resampled the CT scans to inplane resolution. We also centre cropped 2D images ( pixels) around lesion positions, in order to focus on the annotated lesions. The lesion positions are those where at least one of the experts segmented a lesion. We hold 5000 images in the training set, 1000 images in the validation set and 1000 images in the test set. Since the dataset does not provide a single curated groundtruth for each image, we created a “gold standard” by aggregating the labels via STAPLE Asman and Landman (2012), a recent variant of the STAPLE framework employed in the creation of public medical image segmentation datasets e.g., ISLES Winzeck et al. (2018), MSSeg Commowick et al. (2018), Gleason’19 gle datasets. We further note that, as before, we assume labels are only available to the model during training, but not at test time, thus label aggregation methods cannot be applied on the test examples.
On both BraTS and LIDCIDRI datasets, our proposed model achieves a higher dice similarity coefficient than STAPLE and Spatial STAPLE on both of the dense labels and single label scenarios (shown in Table. 4 and Table. 5 in Appendix A.3). In addition, our model (with trace) outperforms STAPLE in terms of CM estimation by a large margin at on BraTS. In Fig. 3(a), we visualized the segmentation results on BraTS and the corresponding annotators’ predictions. Fig. 3(b) presents three examples of the segmentation results and the corresponding four annotator contours, as well as the consensus. As shown in both figures, our model successfully predicts the both the segmentation of lesions and the variations of each annotator in different cases. We also measure the interreader consensus levels by computing the IoU of multiple annotations, and compare the segmentation performance in three subgroups of different consensus levels (low, medium and high). Results are shown in Fig. 10 and Fig. 11 in Appendix A.3.
Additionally, as shown in Table.3, our model consistently outperforms Probabilistic UNet on generalized energy distance across the four test different datasets, indicating our method can better capture the interannotator variations than the baseline Probabilistic UNet. This result shows that the information about which labels are acquired from whom is useful in modelling the variability in the observed segmentation labels.
5 Conclusion
We introduced the first learning method based on CNNs for simultaneously recovering the label noise of multiple annotators and the GT label distribution for supervised segmentation problems. We demonstrated this method on realworld datasets with synthetic annotations and realworld annotations. Our method is capable of estimating individual annotators and thereby improving robustness against label noise. Experiments have shown our model achieves considerable improvement over the traditional label fusion approaches including averaging, the majority vote and the widely used STAPLE framework and spatially varying versions, in terms of both segmentation accuracy and the quality of CM estimation.
In the future, we plan to accommodate metainformation of annotators (e.g., number of years of experience), and nonimage data (e.g., genetics) that may influence the pattern of the underlying segmentation label such as lesion appearance, in our framework. We are also interested in assessing the utility of our approach in downstream applications. Of particular interest is the design of active data collection schemes where the segmentation model is used to select which samples to annotate (“active learning”), and the annotator models are used to decide which experts to label them (“active labelling”)
Yan et al. (2010). Another exciting avenue of applications is education of inexperienced annotators; the estimated spatial characteristics of segmentation mistakes provide further insights into their annotation behaviours, which they may benefit from in improving their annotation quality.Acknowledgement
We would like to thank Swami Sankaranarayanan and Ardavan Saeedi at Butterfly Network for their feedback and initial discussions. MouCheng is supported by GSK funding (BIDS3000034123) via UCL EPSRC CDT in i4health and UCL Engineering Dean’s Prize. We are also grateful for EPSRC grants EP/R006032/1, EP/M020533/1, CRUK/EPSRC grant NS/A000069/1, and the NIHR UCLH Biomedical Research Centre, which support this work.
References

Lazarus et al. [2006]
Elizabeth Lazarus, Martha B Mainiero, Barbara Schepps, Susan L Koelliker, and
Linda S Livingston.
Birads lexicon for us and mammography: interobserver variability and positive predictive value.
Radiology, 239(2):385–391, 2006.  Watadani et al. [2013] Takeyuki Watadani, Fumikazu Sakai, Takeshi Johkoh, Satoshi Noma, Masanori Akira, Kiminori Fujimoto, Alexander A Bankier, Kyung Soo Lee, Nestor L Müller, JaeWoo Song, et al. Interobserver variability in the ct assessment of honeycombing in the lungs. Radiology, 266(3):936–944, 2013.
 Rosenkrantz et al. [2013] Andrew B Rosenkrantz, Ruth P Lim, Mershad Haghighi, Molly B Somberg, James S Babb, and Samir S Taneja. Comparison of interreader reproducibility of the prostate imaging reporting and data system and likert scales for evaluation of multiparametric prostate mri. American Journal of Roentgenology, 201(4):W612–W618, 2013.
 Menze et al. [2014] Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree KalpathyCramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging, 34(10):1993–2024, 2014.
 Joskowicz et al. [2019] Leo Joskowicz, D Cohen, N Caplan, and J Sosna. Interobserver variability of manual contour delineation of structures in ct. European radiology, 29(3):1391–1399, 2019.
 Zhang et al. [2019] Huahong Zhang, Alessandra M Valcarcel, Rohit Bakshi, Renxin Chu, Francesca Bagnato, Russell T Shinohara, Kilian Hett, and Ipek Oguz. Multiple sclerosis lesion segmentation with tiramisu and 2.5 d stacked slices. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 338–346. Springer, 2019.
 Kats et al. [2019] Eytan Kats, Jacob Goldberger, and Hayit Greenspan. A soft staple algorithm combined with anatomical knowledge. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 510–517. Springer, 2019.
 Harvey and Glocker [2019] Hugh Harvey and Ben Glocker. A standardised approach for preparing imaging data for machine learning tasks in radiology. In Artificial Intelligence in Medical Imaging, pages 61–72. Springer, 2019.
 Warfield et al. [2004] Simon K Warfield, Kelly H Zou, and William M Wells. Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE transactions on medical imaging, 23(7):903–921, 2004.
 Winzeck et al. [2018] Stefan Winzeck, Arsany Hakim, Richard McKinley, Jos AADSR Pinto, Victor Alves, Carlos Silva, Maxim Pisov, Egor Krivov, Mikhail Belyaev, Miguel Monteiro, et al. Isles 2016 and 2017benchmarking ischemic stroke lesion outcome prediction based on multispectral mri. Frontiers in neurology, 9:679, 2018.
 Commowick et al. [2018] Olivier Commowick, Audrey Istace, Michael Kain, Baptiste Laurent, Florent Leray, Mathieu Simon, Sorina Camarasu Pop, Pascal Girard, Roxana Ameli, JeanChristophe Ferré, et al. Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Scientific reports, 8(1):1–17, 2018.
 [12] Gleason 2019 challenge. https://gleason2019.grandchallenge.org/Home/. Accessed: 20200230.
 Asman and Landman [2011] Andrew J Asman and Bennett A Landman. Robust statistical label fusion through consensus level, labeler accuracy, and truth estimation (collate). IEEE transactions on medical imaging, 30(10):1779–1794, 2011.
 Asman and Landman [2012] Andrew J Asman and Bennett A Landman. Formulating spatially varying performance in the statistical fusion framework. IEEE transactions on medical imaging, 31(6):1326–1336, 2012.
 Iglesias et al. [2013] Juan Eugenio Iglesias, Mert Rory Sabuncu, and Koen Van Leemput. A unified framework for crossmodality multiatlas segmentation of brain mri. Medical image analysis, 17(8):1181–1191, 2013.
 Cardoso et al. [2013] M Jorge Cardoso, Kelvin Leung, Marc Modat, Shiva Keihaninejad, David Cash, Josephine Barnes, Nick C Fox, Sebastien Ourselin, Alzheimer’s Disease Neuroimaging Initiative, et al. Steps: Similarity and truth estimation for propagated segmentations and its application to hippocampal segmentation and brain parcelation. Medical image analysis, 17(6):671–684, 2013.
 Asman and Landman [2013] Andrew J Asman and Bennett A Landman. Nonlocal statistical label fusion for multiatlas segmentation. Medical image analysis, 17(2):194–208, 2013.
 AkhondiAsl et al. [2014] Alireza AkhondiAsl, Lennox Hoyte, Mark E Lockhart, and Simon K Warfield. A logarithmic opinion pool based staple algorithm for the fusion of segmentations with associated reliability weights. IEEE transactions on medical imaging, 33(10):1997–2009, 2014.
 Castro et al. [2019] Daniel C. Castro, Jeremy Tan, Bernhard Kainz, Ender Konukoglu, and Ben Glocker. MorphoMNIST: Quantitative assessment and diagnostics for representation learning. Journal of Machine Learning Research, 20, 2019.
 Styner et al. [2008] Martin Styner, Joohwi Lee, Brian Chin, M Chin, Olivier Commowick, H Tran, S MarkovicPlese, V Jewells, and S Warfield. 3d segmentation in the clinic: A grand challenge ii: Ms lesion segmentation. Midas Journal, 2008:1–6, 2008.
 Armato III et al. [2011] Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNittGray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011.
 Weisenfeld and Warfield [2011] Neil I Weisenfeld and Simon K Warfield. Learning likelihoods for labeling (l3): a general multiclassifier segmentation algorithm. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 322–329. Springer, 2011.
 Joskowicz et al. [2018] Leo Joskowicz, D Cohen, N Caplan, and Jacob Sosna. Automatic segmentation variability estimation with segmentation priors. Medical image analysis, 50:54–64, 2018.
 Kohl et al. [2018] Simon Kohl, Bernardino RomeraParedes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus MaierHein, SM Ali Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. A probabilistic unet for segmentation of ambiguous images. In Advances in Neural Information Processing Systems, pages 6965–6975, 2018.
 Baumgartner et al. [2019] Christian F Baumgartner, Kerem C Tezcan, Krishna Chaitanya, Andreas M Hötker, Urs J Muehlematter, Khoschy Schawkat, Anton S Becker, Olivio Donati, and Ender Konukoglu. Phiseg: Capturing uncertainty in medical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 119–127. Springer, 2019.
 Raykar et al. [2010] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
 Khetan et al. [2017] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singlylabeled data. arXiv preprint arXiv:1712.04577, 2017.
 Tanno et al. [2019] Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. arXiv preprint arXiv:1902.03680, 2019.
 Dawid and Skene [1979] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28, 1979.
 Smyth et al. [1995] Padhraic Smyth, Usama M Fayyad, Michael C Burl, Pietro Perona, and Pierre Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in neural information processing systems, pages 1085–1092, 1995.
 Whitehill et al. [2009] Jacob Whitehill, Tingfan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
 Welinder et al. [2010] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432, 2010.
 Rodrigues et al. [2013] Filipe Rodrigues, Francisco Pereira, and Bernardete Ribeiro. Learning from multiple annotators: distinguishing good from random labelers. Pattern Recognition Letters, 34(12):1428–1436, 2013.
 Raykar et al. [2009] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Anna Jerebko, Charles Florin, Gerardo Hermosillo Valadez, Luca Bogoni, and Linda Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings of the 26th Annual international conference on machine learning, pages 889–896. ACM, 2009.
 Yan et al. [2010] Yan Yan, Rómer Rosales, Glenn Fung, Mark Schmidt, Gerardo Hermosillo, Luca Bogoni, Linda Moy, and Jennifer Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. In AISTATs, pages 932–939, 2010.

Branson et al. [2017]
Steve Branson, Grant Van Horn, and Pietro Perona.
Lean crowdsourcing: Combining humans and machines in an online
system.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7474–7483, 2017.  Van Horn et al. [2018] Grant Van Horn, Steve Branson, Scott Loarie, Serge Belongie, Cornell Tech, and Pietro Perona. Lean multiclass crowdsourcing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2714–2723, 2018.

Jesson and Arbel [2015]
Andrew Jesson and Tal Arbel.
Hierarchical mrf and random forest segmentation of ms lesions and healthy tissues in brain mri.
Proceedings of the 2015 Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, pages 1–2, 2015.  Chandra et al. [2017] Siddhartha Chandra, Nicolas Usunier, and Iasonas Kokkinos. Dense and lowrank gaussian crfs using deep embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pages 5103–5112, 2017.
 Fey and Lenssen [2019] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
 Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Sukhbaatar et al. [2014] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
Appendix A Additional results
a.1 Annotation Simulation Details
We generate synthetic annotations from an assumed GT on MNIST, MS lesion and BraTS datasets, to generate efficacy of the approach in an idealised situation where the GT is known. We simulate a group of 5 annotators of disparate characteristics by performing morphological transformations (e.g., thinning, thickening, fractures, etc) on the groundtruth (GT) segmentation labels, using MorphoMNIST software Castro et al. [2019]. In particular, the first annotator provides faithful segmentation (“goodsegmentation”) with approximate GT, the second tends oversegment (“oversegmentation”), the third tends to undersegment (“undersegmentation”), the fourth is prone to the combination of small fractures and oversegmentation (“wrongsegmentation”) and the fifth always annotates everything as the background (“blanksegmentation”). To create synthetic noisy labels in multiclass scenario, we first choose a target class and then apply morphological operations on the provided GT mask to create 4 synthetic noisy labels at different patterns, namely, oversegmentation, undersegmentation, wrong segmentation and good segmentation. We create training data by deriving labels from the simulated annotators. We also experimented with varying the levels of morphological operations on MNIST and MS lesion datasets, to test the robustness of our methods to varying degrees of annotation noise.
a.2 Additional Qualitative Results on MNIST and MS Dataset
Here we provide additional qualitative comparison of segmentation results and CM visualization results on MNIST and MS datasets. We examine the ability of our method to learn the CMs of annotators and the true label distribution on single label per image. Fig. 3 and Fig. 5 show the segmentation results on MNIST dataset on single label per image. Our model achieved a higher dice similarity coefficient than STAPLE and Spatial STAPLE, even prominently, our model outperformed STAPLE and Spatial STAPLE without or with trace norm, in terms of CM estimation. Fig. 4 and Fig. 6 illustrate our model on single label still can capture the patterns of mistakes.
a.3 Quantitative and Extra Qualitative Results on BraTS and LIDCIDRI
Here we provide the quantitative comparison of our method and other baselines on BraTS and LIDCIDRI datasets, which have been precluded from the main text due to the space limit (see Table. 4 and Table. 5). We also provide additional qualitative examples (see Fig. 7,8, 9) on both datasets. Lastly, we compare the segmentation performance on 3 different subgroups of LIDCIDRI with varying levels of interreader variability; Fig. 11 illustrates our method attains consistent improvement over the baselines in all cases, indicating its ability to segment more robustly even the hard examples where the experts in reality have disagreed to a large extent.
BraTS 2019 is a multiclass segmentation dataset, containing 259 cases with high grade (HG) and 76 cases with low grade (LG) glioma (a type of brain tumour). For each case, four MRI modalities are available, FLAIR, T1, T1contrast and T2. The datasets are preprocessed by the organizers and coregistered to the same anatomical template, interpolated to the same resolution (1
) and skullstripped. We centre cropped 2D images (192192 pixels) and hold 1600 2D images for training, 300 images for validation, 500 images for testing, we apply Gaussian normalization on each case of each modality, to have zeromean and unit variance. Fig.
7 shows another tumor case in four different modality with different target label. We also present several example results on different methods in Fig. 8.To demonstrate the performance on a dataset with realworld annotations, we have also evaluated our model on LIDCIDRI. The “ground truth” labels in the experiments are generated by aggregating the multiple labels via Spatial STAPLEAsman and Landman [2012] as used in the curation of existing public datasets e.g., ISLES Winzeck et al. [2018], MSSeg Commowick et al. [2018], Gleason’19 gle . Fig. 9 presents several examples of segmentation results from different methods. We also measure the interreader consensus level by computing the IoU of annotations, and compare in Fig. 10 the estimates from our model against the values meansured on the real annotations. Furthermore, we divide the test dataset into low consensus (30% to 65%), middle consensus (65% to 75%) and high consensus (75% to 90%) subgroups and compare the performance in Fig. 11. Our method shows competitive ability to segment the challenging examples with low consensus values. Here we note that the consensus values in our test data range from 30% to 90%,, and compared the dice coefficient of our model with baselines.
On both BraTS and LIDCIDRI dataset, our proposed model consistenly achieves a higher dice similarity coefficient than STAPLE on both of the dense labels and single label scenarios (shown in Table. 4 and Table. 5). In addition, our model (with trace) outperforms STAPLE in terms of CM estimation by a large margin at on BraTS. In Fig. 7, we visualized the segmentation results on BraTS and the corresponding annotators’ predictions. Fig. 8 presents four examples of the segmentation results and the corresponding annotators’ predictions, as well as the baseline methods. As shown in both figures, our model successfully predicts the both the segmentation of lesions and the variations of each annotator in different cases.
BraTS  BraTS  LIDCIDRI  LIDCIDRI  
Models  DICE (%)  CM estimation  DICE (%)  CM estimation 
Naive CNN on mean labels  29.42 0.58  n/a  56.72 0.61  n/a 
Naive CNN on mode labels  34.12 0.45  n/a  58.64 0.47  n/a 
Probabilistic Unet Kohl et al. [2018]  40.53 0.75  n/a  61.26 0.69  n/a 
STAPLE Warfield et al. [2004]  46.73 0.17  0.2147 0.0103  69.34 0.58  0.0832 0.0043 
Spatial STAPLE Asman and Landman [2012]  47.31 0.21  0.1871 0.0094  70.92 0.18  0.0746 0.0057 
Ours without Trace  49.03 0.34  0.1569 0.0072  71.25 0.12  0.0482 0.0038 
Ours  53.47 0.24  0.1185 0.0056  74.12 0.19  0.0451 0.0025 
Oracle (Ours but with known CMs)  67.13 0.14  0.0843 0.0029  79.41 0.17  0.0381 0.0021 
BraTS  BraTS  LIDCIDRI  LIDCIDRI  
Models  DICE (%)  CM estimation  DICE (%)  CM estimation 
Naive CNN on mean & mode labels  36.12 0.93  n/a  48.36 0.79  n/a 
STAPLE Warfield et al. [2004]  38.74 0.85  0.2956 0.1047  57.32 0.87  0.1715 0.0134 
Spatial STAPLE Asman and Landman [2012]  41.59 0.74  0.2543 0.0867  62.35 0.64  0.1419 0.0207 
Ours without Trace  43.74 0.49  0.1825 0.0724  66.95 0.51  0.0921 0.0167 
Ours  46.21 0.28  0.1576 0.0487  68.12 0.48  0.0587 0.0098 
a.4 Lowrank Approximation
Here we show our preliminery results on the employed lowrank approximation of confusion matrices for BraTS dataset, precluded in the main text. Table. 6 compares the performance of our method with the default implementation and the one with rank1 approximation. We see that the lowrank approximation can halve the number of parameters in CMs and the number of floatingpointoperations (FLOPs) in computing the annotator prediction while resonably retaining the performance on both segmentation and CM estimation. We note, however, the practical gain of this approximation in this task is limited since the number of classes is limited to 4 as indicated by the marginal reduction in the overall GPU usage for one example. We expect the gain to increase when the number of classes is larger as shown in Fig. 12.
Rank  Dice  CM estimation  GPU Memory  No. Parameters  FLOPs 
Default  53.47 0.24  0.1185 0.0056  2.68GB  589824  1032192 
rank 1  50.56 2.00  0.1925 0.0314  2.57GB  294912  405504 
Lastly, we also describe the details of the devised lowrank approximation. Analogous to Chandra and Kokkinos’s work Chandra et al. [2017] where they employed a similar approximation for estimating the pairwise terms in densely connected CRF, we parametrise the spatial CM, as a product of two smaller rectangular matrices and of size where . In this case, the annotator network outputs and for each annotator in lieu of the full CM. Two separate rectangular matrices are used here since the confusion matrices are not necessarily symmetric. Such lowrank approximation reduces the total number of variables to from and the number of floatingpoint operations (FLOPs) to from . Fig. 12 shows that the time and space complexity of the default method grow quadratically in the number of classes while the lowrank approximations have linear growth.
Appendix B Implementation details
Our method is implemented in Pytorch 1.0
Fey and Lenssen [2019]. Our network is based on a 4 downsampling stages 2D Unet Ronneberger et al. [2015], the channel numbers for each encoders are 32, 64, 128, 256, we also replaced the batch normalisation layers with instance normalisation. Our segmentation network and annotator network share the same parameters apart from the last layer in the decoder of Unet, essentially, the overall architecture is implemented as an Unet with multiple output last layers: one for prediction of true segmentation; others for predictions of noisy segmentation respectively. For segmentation network, the output of the last layer has c channels where c is the number of classes. On the other hand, for annotator network, by default, the output of the last layer has number of channels for estimating confusion matrices at each spatial location; when lowrank approximation is used, the output of the last layer has 2 L number of channels. The Probabilistic Unet implementation is adopted from https://github.com/stefanknegt/ProbabilisticUnetPytorch, for fair comparison, we adjusted the number of the channels and the depth of the Unet backbone in Probabilistic Unet to match with our networks. All of the models were trained on a NVIDIA RTX 208 for at least 3 times with different random initialisations to compute the mean performance and its standard deviation. The Adam Kingma and Ba [2014] optimiser was used in all experiments with the default hyperparameter settings. We also provide all of the hyperparameters of the experiments for each data set in Table 7. We also kept the training details the same between the baselines and our method.Data set  Learning Rate  Epoch  Batch Size  Augmentation  weight for regularisation () 
MNIST  1e4  60  2  Random flip  0.7 
MS  1e4  55  2  Random flip  0.7 
BraTS  1e4  60  8  Random flip  1.5 
LIDC  1e4  75  4  Random flip  0.9 
b.1 Pytorch implementation of loss function
The following is the Pytorch implementation of the loss function in eq. (4). We also intend to clean up the whole codebase and release in the final version.
Appendix C Proof of Theorem 1
We first show a specific case of Theorem 1 when there is only a single annotator, and subsequently extend it to the scenario with multiple annotators. Without loss of generality, we show the result for an arbitrary choice of a pixel in a given input image . Specifically, let us denote the estimated confusion matrix (CM) of the annotator at the pixel by , and suppose the true class of this pixel is i.e., where denotes the elementary basis. Let denote the dimensional estimated label distribution at the corresponding pixel (instead of over all the whole image).
Lemma 1.
If the annotator’s segmentation probability is fully captured by the model for the pixel in image i.e., , and both , A satisfy that for and for all such that , then is minimised when . Furthermore, if , then the true label is fully recovererd i.e., and the column in , A are the same.
Proof.
We first show that the diagonal element in A is smaller than or equal to its estimate in . Since is a onehot vector, holds and , it follows that:
(5)  
(6) 
The possibility of equality in the above comes from the fact that all entries in except the th element could be zeros. Now, the assumption that there is a single ground truth label for the pixel means that all the values of the true CM, A are uniformly equal to except the column. In addition, since the diagonal dominance of the estimated CM means each is at least , we have that
It therefore follows that when holds, the trace of is the smallest. Now, we show that when this holds i.e., , then the columns of the two matrices match up.
By way of contradiction, let us assume that there exists a class for which the estimated label probability is nonzero i.e., . This implies that . From eq. (6), if the trace of A and are the same, then also holds and thus we have . By rearranging this equality and dividing both sides by , we obtain . Now, as we have , it follows that
which is false. Therefore, the trace quality implies and thus from , we conclude that the columns of and A are the same.
∎
We note that the equivalent result for the expectation of the annotator’s CM over the data population was provided in Sukhbaatar et al. [2014] and Tanno et al. [2019]. The main difference is, as described in the main text, that we show a slightly weaker version of their result in a samplespecific scenario.
Now, we show that the main theorem follows naturally from the above lemma. As a reminder, we recite the theorem below.
Theorem 1. For the pixel in a given image , we define the mean confusion matrix (CM) and its estimate where is the probability that the annotator labels image . If the annotator’s segmentation probabilities are perfectly modelled by the model for the given image i.e., , and the average true confusion matrix at a given pixel and its estimate satisfy that for and for all such that , then and such solutions are unique in the columns where is the correct pixel class.
Proof.
A direct application of Lemma 1 shows firstly that is minimised when for all (since that ensures ). Secondly, it implies that minimising yields . Because we assume that annotators’ noisy labels are correctly modelled i.e., , it therefore follows that the column in and are the same.
∎
Comments
There are no comments yet.