Inter-rater variability limits the achievable segmentation performance of deep learning segmentation by introducing human error to the ground-truth (Carass et al., 2017). Tasks such as multiple sclerosis (MS) lesions segmentation are highly challenging due to the smallness of lesions and the poorly defined borders, leading to a low inter-rater agreement and high deep learning model uncertainty (Gros et al., 2019; Nair et al., 2020). For instance, some experts tend to over-segment, others under-segment, yielding “confusion” for the segmentation model trained on data labelled by different raters (Jungo et al., 2018). Understanding the rater style could allow for better performance of models, e.g., by integrating this knowledge within the deep learning training scheme.
2 Related Works
Previous studies have shown that the rater style can be learned (Shwartzman et al., 2019), and therefore the inter-rater disagreement patterns could potentially also be learned by the model (Chotzoglou and Kainz, 2019). There as also been work on jointly learning individual rater characteristic at the same time as ”true” consensus segmentation, in classification (Zhang et al., 2020), segmentation (Tanno et al., 2019) , and object detection (Sudre et al., 2019). Also shown was that the method used to generate the ground truth from multiple rater annotations, e.g., label fusion (Jungo et al., 2018) and label sampling (Jensen et al., 2019) largely impacts the model uncertainty.
While many studies have addressed the uncertainty introduced by multiple raters, fewer work addressed the uncertainty introduced by a single rater. A model trained with data from a single rater will still exhibit some level of uncertainty due to rater style, and our goal is therefore to find what factors in a rater’s style generate uncertainty. Those factors could include tendency to under/over-segment, consistency across images, non-independence of raters (e.g. influence of the expert who trained the rater). Intuitively, a non-biased and highly consistent rater would be the ideal candidate for training a deep learning model. We therefore expect a correlation between a rater’s bias/consistency and the uncertainty of the model trained with their annotations. This would mean that characterization of the rater’s bias could eventually be incorporated as prior knowledge within the learning scheme or in the reporting of uncertainty (post-processing).
3 Material and Methods
Two public MRI datasets with multiple raters annotations were used. The first is a brain multiple sclerosis (MS) lesion dataset introduced at a MICCAI 2016 challenge (Commowick et al., 2018). It consists of 15 subjects each annotated by seven raters from three different centers. The second dataset is a spinal cord (SC) gray matter (GM) introduced at a segmentation challenge challenge (Prados et al., 2017), which includes 40 subjects with annotations from four raters (all raters from a different center).
In this paper, we characterize rater’s style using rater bias and consistency. Since the consensus of all raters is the closest we have to the real ground truth, we define a rater’s bias to be the average difference (in terms of positive voxel count) between the rater’s annotation and the consensus across all volumes:
With the number of positive voxels in a segmentation mask (i.e., belong to the target segmentation class). Consensus is defined by majority voting, as explained in section 3.3. A positive or a negative bias therefore measures if a given rater has a tendency to over- or under-segment, respectively. Images refer to 3D volumes, but using a 2D slices as a basis instead would give the same results up to a constant factor, since bias is an average.
Similarly, we define rater consistency as the standard deviation of the difference (in terms of positive voxel count) between the rater’s annotation and the consensus across all volumes:
Consistency therefore measures whether a rater is always either over-segmenting or under-segmenting (consistent, close to zero) or if they are doing a bit of both (inconsistent: higher values).
We choose to use an absolute bias metric as opposed to a relative one since we think not all slices deserved the same weight. For example, it would be unfair to penalize a rater by the same amount for a 10% error on a slice showing only a single 10-voxel lesion, versus for a 10% error on a slice with multiple lesions totaling hundreds of voxels. The error in the former case is likely negligible, whereas the error in the latter case is large and systematic (multiple lesions), but both would have the same impact on the computed bias if we had used a relative metric. We however considered using relative instead of absolute metrics, by normalizing the difference used in bias and consistency by the number of positive voxels in the consensus in each image. Results of these investigations are in appendix Appendix A. Relative bias, and show that the bias/uncertainty relationship in figure 2 and 3 still holds when using relative bias.
Images were resampled ( for MS brain and for SC GM) and cropped (respectively and ) before being fed to the models. Data augmentation (rotation, translation, scaling) was applied slice-wise. Datasets were split randomly for training/validation/testing respectively. 2D U-Nets (Ronneberger et al., 2015) were trained slice-wise with the annotation of each individual rater. While it is no more state of the art, a 2D U-Net is sufficient since it can achieve near inter-rater variability levels of performance (Gros et al., 2019; Vincent et al., 2020)
. Additional performance would not be beneficial since the main goal is to study uncertainty and not segmentation performance. A more advanced architecture would probably only result in overfitting on some rater styles. Training was done on NVIDIA P100 GPUs using the open source frameworkblueivadomed 111http://ivadomed.org/ (Gros et al., 2021)
which is based on PyTorch(Paszke et al., 2019)
. Configuration files containing all hyperparameters for both datasets are also availablebluehere 222https://github.com/olix86/paper_rater_uncertainty. Models were trained using a Dice loss (Milletari et al., 2016). Inference was then done on the test set to measure model’s performance (Dice score) and aleatoric uncertainty (Wang et al., 2019)
. Uncertainty is estimated using test-time data augmentation (rotation, translation, scaling), and is computed as the entropy of 10 Monte Carlo samples for each image. The exact settings for the transforms are described in the config files linked above. The choice of aleatoric uncertainty was made because it is considered as being representative of “inherent” uncertainty in the data, whereas epistemic uncertainty is considered dependent on the model parameters (i.e. it could go away with more data)(Kendall and Gal, 2017; Kiureghian and Ditlevsen, 2009). All the previous steps (pre-processing, data augmentation, training, evaluation and uncertainty computation) were done with ivadomed. Preliminary experiments on the MS brain dataset showed that generating ground truth with STAPLE (Warfield et al., 2004) yielded similar results in terms of rater style (bias & consistency) compared to majority voting. The only difference was a constant offset to all raters bias, meaning that majority voting has a tendency to over-segment when compared to STAPLE. Since this affects all raters and doesn’t have an impact when comparing styles between raters, ground-truths were generated using majority voting due to it being easier to interpret (i.e. consensus voxel = 1 if at least 50% of raters voted 1). By default, the term “consensus” will refer to this combination of all raters for a given dataset, however, single-center consensuses were also computed using the same method and will be compared to the global consensus.
4.1 Rater style
We first examine rater style in the form of bias and consistency relative to consensus for MS brain. Styles are shown in Figure 1 and solely depend on the ground truths from each rater; they do not involve any deep learning model.
We notice 3 clusters which are clearly delimited by the center to which raters belong. This implies that rater style depends a lot more on the rater’s center than its individual characteristics. Indeed, cluster radii of are a lot smaller than the distances between pairs of clusters centroid . To assess the quality of the clustering we use the Davies-Bouldin index (Davies and Bouldin, 1979), a metric which quantifies the quality of clustering through ratios of intra-cluster scatter to inter-cluster distance (lower is better). Here, , meaning that intra-cluster scatter is quite lower than inter-cluster distances. Our hypothesis is that uncertainty for individual raters should follow a similar center-centric pattern assuming it depends on the rater style. This does not apply for the GM dataset since it contains only a single rater per center.
In both datasets raters with a higher bias also have higher uncertainty. Over-segmentation (bias ) seems to be associated with higher uncertainty than under-segmentation (bias ). Raters are also clustered by center for the MS brain dataset, but in this case the distance between clusters is smaller than in the rater-style graph, since other factors also influence uncertainty, such as noise in data and the limited size of the training set.
It is interesting to note that while a higher rater bias produces higher uncertainty, it does not affect model performance as assessed by the Dice score (), as shown in Figure 4.
All raters exhibit some level of bias and as we saw earlier, bias is correlated with uncertainty (Figures 2-3). We now investigate whether combining raters through consensus would lower uncertainty when compared to single-rater training. Results of this investigation are shown in Figure 5, highlighting a consensus uncertainty 30% lower than the average across individual raters.
Center-wise consensuses do not, however, exhibit the same characteristic, as they have higher uncertainty than the global (multi-center) consensus and are comparable (lower for center 2, slightly higher for center 1, and irrelevant for the single rater of center 3) to the average uncertainty of their raters used individually (Figure 6).
Finally, the previous results are also reflected in the performance (Dice score) of models with the global consensus scoring a full above the average of individual raters, and scoring above single center consensuses as shown in Table 1. Thus, it seems that combining raters from different centers has a more positive impact on uncertainty and Dice than combining raters from the same center.
|Dice score for consensus|
|Center 1 consensus||0.47|
|Center 2 consensus||0.46|
|Center 3 consensus||0.43|
5.1 Key takeaways
This study shows that rater style can be characterised by measuring rater consistency and bias. Moreover, results from the brain MS dataset suggest that rater style is mostly center-specific instead of rater-specific. Results also show on both MS brain and SC GM that when using annotations from a single rater to train a deep learning model, a high rater bias leads to high model uncertainty. This is interesting since these models are trained on annotations from a single rater and therefore have never “seen” the consensus although bias relative to consensus still impacts uncertainty. While rating style impacts the amount of uncertainty, bias doesn’t directly affect the average performance (Dice score) of the model meaning that the rater style can be learned by the model regardless of uncertainty. A mechanism that could potentially explain why oversegmentation leads to higher uncertainty is partial volume effect. Indeed, a rater that undersegments (e.g. labelling only voxels that contain 100% lesion tissue, and not those at the boundary that contain some other tissue) would give an easier task to the model; voxels labelled as lesions are homogeneous, and simple to identify. At the opposite, a rater that oversegments also includes voxels containing a varying percentage of lesion tissue, which is potentially harder since there is less homogeneity, therefore yielding more uncertainty.
Another interesting result is that uncertainty was lower for the global consensus model (i.e., when fusing all raters’ annotations into a single binary annotation) than for models trained using annotations from a single rater. We hypothesize this phenomenon originates from the biases of individual raters which get smoothed away when combining raters from different centers which have different styles. This is also probably why combining raters’ annotations from a single center (center-wise consensuses) does not reduce a model’s uncertainty : individual bias can’t cancel out since we combine raters with similar styles and shortcomings. A single rater, such as the one from Center #3, can therefore have lower uncertainty than the consensus from the four raters of Center #1 due to their higher bias. Multi-center consensus could therefore be a mechanism to lower the impact of rater style.
This lower uncertainty for the global consensus however opens up questions regarding the impact of inter-rater variability on uncertainty. Inter-rater variability by definition is not present for single-rater models, but is present in center-wise consensuses, and is at its highest for the global consensus since it combines raters with diverging styles. Our results therefore suggest that the reduction in rater bias when going from single rater to global consensus has a bigger impact on uncertainty than the addition of inter-rater variability. It is therefore possible that inter-rater variability is indeed present but relatively constant throughout the dataset, thus not generating much uncertainty. A limitation of this study is that the number of raters (7 for the MS dataset and 4 for the SC GM dataset) is relatively small, therefore our results would benefit from further validations in datasets with larger pools of raters from different centers.
5.2 Impact and perspectives
Raters used in the MS brain study were junior raters trained by senior raters from their center (Commowick et al., 2018), therefore the mutual influence between raters during the learning and segmentation process probably drives the similarities in rating style. This center-wise rater style pattern raises a few questions concerning label fusion, which is largely used in deep learning medical imaging studies. Indeed, in the case of the MS brain dataset, since the split between centers is 4-2-1, if one uses a majority voting consensus it essentially becomes the vote of the four raters from a single center, negating the benefits of having two additional centers with raters in the study. It is doubtful that STAPLE and its variants could really solve the issue since it is based on majority voting, only with weights updated iteratively. If the four raters from one center dominate during the first iteration, the remaining raters will see their weighting be progressively reduced until convergence.
Future studies should therefore consider whether raters from the same center can really be considered independent, or if voting should be weighted by centers instead of raters. Weights of raters could also be considered as hyperparameters that can be optimised in order to minimize uncertainty. Alternatively, raters weight could be incorporated into the input ground truth segmentation using a “soft training” pipeline (Gros et al., 2020). While our rater style was defined as simple metrics independent of deep learning models, it would be interesting to see if learned rater style approaches (Zhang et al., 2020; Tanno et al., 2019; Sudre et al., 2019) show a similar relationship to uncertainty.
Other potentially interesting metrics include measuring boundary difference instead of volume difference. An exemple would be the average symmetric surface distance (ASSD) which computes the average euclidean distance between the object boundaries across raters. This metric would be particularly relevant for the MS lesion task where there is a large heterogeneity of object shape, and therefore it could be interesting to complement the volume difference analysis with some shape analysis. Indeed, an increase of lesion radius (e.g. evenly adding 1 voxel along the lesion boundary) would have a different impact on the relative increase of the lesion volume if the lesion is small or large (e.g., 10-voxels vs. 100-voxels lesion). Therefore, from a ”radius segmentation style” perspective, it could be said that our absolute metric over-weights large lesions at the expense of small ones, whereas it would be the opposite for our relative metric. However, measures based on boundaries also have drawbacks. It is possible that two raters segment the same lesion volume with a slightly different boundary (translation, change of shape, etc). Therefore such a metric would measure a bias even though there is none (there is indeed a disagreement, but not in the form of over/under-segmentation that we are looking for). To summarize, the main drawback of our volumetric bias is the possibility that it turns out to be non-linearly dependant on some other underlying bias (e.g. if it in facts depends on the radius). While we present two bias metrics here, a detailed comparisons with other relevant metrics would be interesting to explore in future research.
Finally, uncertainty could have potential applications for quality control such as identifying biased raters when there are not many ratings available. As an example, a rater generating significantly higher than expected uncertainty for a given task could be excluded as an outlier. Conversely, rating style could be used as a pre-processing step to “correct” biases on an individual rater basis or incorporated as a prior in future deep learning architecture. Model segmentation could be modulated using metrics about the rater style (e.g used as inputs for FiLM layers(Perez et al., 2017; Lemay et al., 2021)).
The authors would like to thank Andréanne Lemay and Lucas Rouhier from the IVADO medical imaging team for helpful discussions. Funded by the Institut de valorisation des données (IVADO), the Canada Research Chair in Quantitative Magnetic Resonance Imaging [950-230815], the Canadian Institute of Health Research [CIHR FDN-143263], the Canada Foundation for Innovation [32454, 34824], the Fonds de Recherche du Québec - Santé , the Natural Sciences and Engineering Research Council of Canada [RGPIN-2019-07244]. FRQNT Strategic Clusters Program (2020‐RS4‐265502 ‐ Centre UNIQUE ‐Union Neurosciences & Artificial Intelligence –Quebec, Canada First Research Excellence Fund through the TransMedTech Institute. C.G has a fellowship from IVADOMED [EX-2018-4], O.V. has a fellowship from NSERC, FRQNT and UNIQUE.
The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.
We declare we don’t have conflicts of interest.
Appendix A. Relative bias
In this appendix we present the equivalent of figure 2 and 3 using relative instead of absolute bias. Relative bias is defined in equation 3 in a similar way as as absolute bias in equation 1, with the only change being the fact that we normalize the difference between rater and consensus by the number of positive voxels in the consensus in each image, therefore ensuring no volume has a disproportionate weight.
Figure 7 and 8 show that on both datasets, the relationship between uncertainty and bias is still present using relative bias. Correlation is slightly stronger (0.64 vs 0.60) for MS and identical for (0.93) for GM than when using relative bias compared to absolute. On the qualitative side, for MS lesions we observe in figure 7 that one rater is an outlier (blue dot close to orange ones). It already was relatively far from its peer in figure 2, but this is exacerbated here. On GM segmentation, figure 8 shows that switching from relative to absolute bias makes pretty much no difference. The rater distribution is almost identical to figure 3. Overall GM bias is a lot lower than MS in both the absolute and relative cases, since the task is easier there is less disagreement between raters. The lack of difference between the relative and absolute bias for GM is potentially explained by the fact that the GM volume varies a lot less across slices and subjects than MS lesions. It is inline with our expectations that relative bias is useful to accentuate the errors on very small lesions, which is why this re-weighting affects mostly the MS dataset.
- Longitudinal Multiple Sclerosis Lesion Segmentation: Resource & Challenge. NeuroImage 148, pp. 77–102. External Links: Cited by: §1.1.
- Exploring the Relationship Between Segmentation Uncertainty, Segmentation Performance and Inter-observer Variability with Probabilistic Networks. In Large-Scale Annotation of Biomedical Data and Expert Label Synthesis and Hardware Aware Learning for Medical Imaging and Computer Assisted Intervention, L. Zhou, N. Heller, Y. Shi, Y. Xiao, R. Sznitman, V. Cheplygina, D. Mateus, E. Trucco, X. S. Hu, D. Chen, M. Chabanas, H. Rivaz, and I. Reinertsen (Eds.), Vol. 11851, pp. 51–60 (en). External Links: Cited by: §2.
- Objective Evaluation of Multiple Sclerosis Lesion Segmentation using a Data Management and Processing Infrastructure. Scientific Reports 8 (1), pp. 13650 (en). Note: Number: 1 Publisher: Nature Publishing Group External Links: Cited by: §3.1, §5.2.
- A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1 (2), pp. 224–227. Note: Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence External Links: Cited by: §4.1.
Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks. NeuroImage 184, pp. 901–915 (en). External Links: Cited by: §1.1, §3.3.
- SoftSeg: Advantages of soft versus binary training for image segmentation. arXiv:2011.09041 [cs, eess]. Note: arXiv: 2011.09041 External Links: Cited by: §5.2.
- Ivadomed: A Medical Imaging Deep Learning Toolbox. Journal of Open Source Software 6 (58), pp. 2868 (en). External Links: Cited by: §3.3.
Improving Uncertainty Estimation in Convolutional Neural Networks Using Inter-rater Agreement. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Vol. 11767, pp. 540–548 (en). External Links: Cited by: §2.
- On the Effect of Inter-observer Variability for a Reliable Estimation of Uncertainty of Medical Image Segmentation. arXiv:1806.02562 [cs]. Note: arXiv: 1806.02562 External Links: Cited by: §1.1, §2.
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?. arXiv:1703.04977 [cs]. Note: arXiv: 1703.04977 External Links: Cited by: §3.3.
- Aleatory or epistemic? Does it matter?. Structural Safety 31 (2), pp. 105–112 (en). External Links: Cited by: §3.3.
- Benefits of Linear Conditioning for Segmentation using Metadata. arXiv:2102.09582 [cs, eess]. Note: arXiv: 2102.09582 External Links: Cited by: §5.2.
- V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. External Links: Cited by: §3.3.
- Exploring uncertainty measures in deep networks for Multiple sclerosis lesion detection and segmentation. Medical Image Analysis 59, pp. 101557 (en). External Links: Cited by: §1.1.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 [cs, stat]. Note: arXiv: 1912.01703 External Links: Cited by: §3.3.
- FiLM: Visual Reasoning with a General Conditioning Layer. arXiv:1709.07871 [cs, stat] (en). Note: arXiv: 1709.07871 External Links: Cited by: §5.2.
- Spinal cord grey matter segmentation challenge. NeuroImage 152, pp. 312–329 (en). External Links: Cited by: §3.1.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597 [cs] (en). Note: arXiv: 1505.04597 External Links: Cited by: §3.3.
- The Impact of an Inter-rater Bias on Neural Network Training. arXiv:1906.11872 [cs, eess]. Note: arXiv: 1906.11872 External Links: Cited by: §2.
- Let’s agree to disagree: learning highly debatable multirater labelling. arXiv:1909.01891 [cs] (en). Note: arXiv: 1909.01891 External Links: Cited by: §2, §5.2.
- Learning From Noisy Labels by Regularized Estimation of Annotator Confusion. pp. 11244–11253. External Links: Cited by: §2, §5.2.
- Automatic segmentation of spinal multiple sclerosis lesions: How to generalize across MRI contrasts?. arXiv:2003.04377 [cs, eess]. Note: arXiv: 2003.04377 External Links: Cited by: §3.3.
- Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing 338, pp. 34–45 (en). External Links: Cited by: §3.3.
- Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation. Ieee Transactions on Medical Imaging 23 (7), pp. 903–921. External Links: Cited by: §3.3.
- Disentangling Human Error from the Ground Truth in Segmentation of Medical Images. arXiv:2007.15963 [cs]. Note: arXiv: 2007.15963 External Links: Cited by: §2, §5.2.