Deep neural networks (and more specifically convolutional neural networks) have deeply percolated through healthcare R
D addressing various problems such as survival prediction, disease diagnostics, image registration, anomaly detection, and segmentation of images be it Magnetic Resonance Images (MRI), Computed Tomography (CT) or ultrasound (US) to name a fewLitjens et al. (2017). The roaring success of deep learning methods is rightly attributed to the unprecedented amount of annotated data across domains. But ironically, while solutions to decade-long medical problems are at hand Bernard et al. (2018), the use of neural networks in day-to-day practice is still pending. This can be explained in part by the following two observations. First, while being accurate on average, neural networks can nonetheless be sometimes wrong N.Painchaud et al. (2020) as they provide no strict clinical guarantees. In other words, any neural network within the intra-expert variability is excellent on average but not immune to sparse erroneous (yet degenerated) results which is problematic in clinical practice Bernard et al. (2018). Second, machine learning methods are known to suffer from domain adaptation problems, one of the most glaring medical imaging issue of our times Venkataramani et al. (2019). As such, clinically accurate machine learning methods trained on a specific set of data almost always see their performances drop when tested on a dataset acquired following a different protocol. These problems derive in good part from the fact that current datasets are still relatively small. According to Maier-Hein et al. Maier-Hein et al. (2018) most medical imaging challenges organized so far contain less than 100 training and testing cases. This shows that medical applications cannot rely on very large medical dataset encompassing tens of thousands of annotated data acquired in various conditions, with machines of various vendors showing clinical conditions and anatomical configurations of all kinds.
For these reasons, the medical imaging literature have had an increasing number of publications whose goal is to compensate for the lack of expert annotations Karimi et al. (2020). While some methods leverage partly-annotated datasets Can et al. (2018), others use domain adaptation strategies to compensate for small training datasets Choudhary et al. (2020)
. Some other approaches artificially increase the number of annotated data with Generative Adversarial Networks (GANs)Goodfellow et al. (2014); Skandarani et al. (2020) and others use third-party neural networks to help experts annotate images more rapidly Girum et al. (2020).
While these methods have been shown effective for their specific test cases, it is widely accepted that large manually-annotated datasets brings indisputable benefits Sun et al. (2017). In this work, we depart from trying to improve the segmentation methods and focus on the datasets as we challenge the idea that medical data, cardiac cine MRI specifically, needs to be labeled by experts only and explore the consequences of using non-expert annotations on the generalization capabilities of a neural network. Non-expert here refers to a non-physician who could not be regarded as a reference in the field. While non-expert annotations are easier and cheaper to get, they could be used to build larger datasets faster and at a reduced cost.
This idea was tested on cardiac cine-MRI segmentation. To this end, we had two non-experts labeling cardiac cine-MRI images and compared the performance of neural networks trained on non-expert and expert data. The evaluation between both approaches was done with geometric metrics (Dice index and Hausdorff distance) as well as clinical parameters namely, the ejection fraction for the left and right ventricles and the myocardial mass.
Methods and Data
As mentioned before, medical data annotation requires a rightful expertise so the labeling can be used with full confidence. Expert annotators are typically medical doctors or medical specialists whose training and experience are reliable sources of truth for the problem to solve. These experts often have close collaborators working daily with medical data, typically computer scientists, technicians, biophysicist, etc. While their understanding of the data is real, these non-experts are typically not considered as a reliable source of truth. Non-expert are thus considered as people who can manually label data but whose annotations are biased and/or noisy and thus unreliable.
In this study, two non-experts were asked to label 1902 cardiac images. We defined a non-expert as someone with no professional expertise on cardiac anatomy nor cine-MR images. The Non-Expert 1 is a technician in biotechnology who received a 30 minute training by a medical expert on how to recognize and outline cardiac structures. The training was done on a few examples where the expert showed what the regions of interest in the image look like and where their boundaries lie. Training also came with an introduction to the cardiac anatomy and its temporal dynamics. The Non-Expert 2 is a computer scientist with 4 years of active research in cardiac cine-MRI with several months of training. In the case of the Non-Expert 2, the training span several months where directions about the imaging modality as well as the anatomy and pathologies where thoroughly explained. In addition, fine delineation guidelines were provided to disambiguate good from poor annotations. In this study, we both gauge the effect of training a neural network on non-expert data and also verify how the level of training of the non-experts impact the overall results.
The non-experts were asked to delineate three structures of the heart namely, the left ventricular cavity (endocardial border), the left ventricle myocardium (epicardial border) and the endocardial border of the right ventricle. No further quality control was done to validate the non-expert annotations. Segmentations were used as-is for the subsequent tasks.
We used the gold standard for medical image segmentation U-Net Ronneberger et al. (2015) as the baseline network. In addition, the well-known Attention U-Net Oktay et al. (2018) and ENet Paszke et al. (2016) networks were trained in order to ensure that the results are affected by the differences in annotations and not the network architecture. We first trained the the segmentation models (U-Net, Attention U-Net and ENet) on the original ACDC dataset (Automated Cardiac Diagnosis Challenge) Bernard et al. (2018) with its associated groundtruth using a classical supervised training scheme, with a combined cross-entropy and Dice loss:
where is the probabilistic output for image ( is the number of images in the batch) and class ( is the number of classes). is the predicted output of the network,
is a one-hot encoding of the ground truth segmentation map.
We then re-trained the neural networks with the non-expert labels using the same training configuration. Furthermore, considering that the non-expert annotations can be seen as noisy versions of the true annotation (i.e. where is the non-expert annotation, is the groundtruth and
a random variable), we also trained the networks with a mean absolute error loss which, as shown by Ghosh et alGhosh et al. (2017), has the solve property of compensating for labeling inaccuracies.
To test whether non-expert annotated datasets hold any value for cardiac MRI segmentation, the following two cardiac cine MRI datasets were used:
Automated Cardiac Diagnosis Challenge (ACDC) dataset Bernard et al. (2018). This dataset comprises 150 exams acquired at the University Hospital of Dijon (all from different patients). It is divided into 5 evenly distributed subgroups (4 pathological plus 1 healthy subject groups) and split into 100 exams for training and 50 held out set for testing. The exams were acquired using two MRI scanners with different magnetic strengths (1.5T and 3T). The pixel spacing varies from to with a slice spacing varying between to . An example of images with the different expert and non-expert annotations is shown in Figure 1.
Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation (M&M) dataset cam . This dataset consists of 375 cases from 3 different countries (Spain, Germany and Canada) totaling 6 different centres with 4 different MRI manufacturers (Siemens, General Electric, Philips and Canon). The cohort is composed of patients with hypertrophic and dilated cardiomyopathies as well as healthy subjects. The cine MR images were annotated by experienced clinicians from the respective centres.
We trained the segmentation models on the 100 ACDC training subjects on either the expert and non-expert groundtruth data. Training was done with a fixed set of hyperparameters, chosen through a cross-validated hyper-parameters search to best fit the 3 annotators, without tuning it further. The networks were trained three times in order to reduce the effect of the stochastic nature of the training process on the results.
As mentioned before, we first trained the neural networks on non-expert data with exactly the same setup as for the expert annotations. Then, we retrained from scratch the neural networks (U-Net, Attention U-Net and ENet) with a L1 loss which was shown to be robust to noisy labels Ghosh et al. (2017).
We then tested in turn on the 50 ACDC test subjects and the 150 M&Ms training data. The M&Ms dataset constitutes data with groundtruth that is not biased towards either of the annotators of the training set, be it the expert or the non-expert. Moreover, testing on different datasets provides an inter-expert variability range as well as a domain generalization setup.
Results and Discussion
The first set of results are laid out in Table 1. It corresponds to standard geometrical metrics, i.e. the Dice score and the Hausdorff distance (HD) for the left ventricular (LV) cavity (Table 1), the myocardium (MYO) (considering the endocardial and epicardial of the left ventricle) (Table 2) and the cavity of the right ventricle (RV) (Table 3). It also contains the end-diastolic volume (EDV) as well as the ejection fraction (EF) for the LV and the RV and the myocardial mass error. For all three tables, the networks (U-Net, Attention U-Net and ENet) has been trained on the ACDC training set and tested on the ACDC testing set and the M&Ms training set.
Results for the ACDC testing set reveal that for the LV (Table 1) the networks trained on the non-expert annotations (Non-Expert 1 as well as Non-Expert 2) manage to achieve performances that are statistically indistinguishable from those of the expert. And that is true regardless of the training loss (CE+Dice vs MAE+Dice) and the metric (Dice, HD, and EF). The only exception is for the EF error for Non-Expert 1 with loss function MAE+Dice.
The situation however is more fuzzy for the MYO and the RV. In both cases, we can see that results for the Non-Expert 1 are almost always worse than that of the expert, especially for the CE+Dice loss. For example, there is a Dice score drop of on the myocardium. Also, the clinical results on the RV (Table 3) show a clear gap between the Non-Expert 1 and the other annotators. However, we can appreciate how the MAE+Dice loss improves results for both non-experts. Overall for the MYO and the RV, results for the Non-Expert 2 are very close (if not better) than that of the expert. This is obvious when considering the average myocardial mass error in Table 2. Although, one recurrent result from our experiments is the hit-and-miss performance of all the evaluated networks on the M&Ms dataset, where in a number of cases the output segmentation was completely degenerated as shown in Figure 2. Moreover, the difference in segmentation performance between all the annotators is similar regardless of the segmentation network (U-Net, Attention U-Net or ENet) used, although Attention U-Net shows the best performance overall which is to be expected given its larger capacity.
Further analysis of the segmentation performance on the different sections of the heart (Figure 3), namely the base, the middle and the apex, show that the differences between the non-experts and the expert annotations lie heavily on the two ends of the heart. The performance gap is more pronounced on the apex for the three anatomical structures. In parallel, when we look at the performance from the disease groups (Figure 4), we can distinguish a relative similarity in the Dice score between the different annotators and disease groups.
Our experiments also reveal some interesting results on the M&Ms dataset, a dataset with different acquisition settings than for ACDC. In that case, we see the gap in performance between the expert and the non-expert decrease substantially. For example, when comparing the Non-Expert 1 results with MAE+Dice loss and those from the expert annotation, we see that the Dice difference for the RV went from a 6% on the ACDC dataset to a mere 4% on the M&Ms dataset. But overall, there again, results by Non-Expert 2 are similar (and sometimes better) than that of the expert.
Through out our experiments, the performances of the three neural networks (U-Net, Attention U-Net and ENet) trained on the Non-Expert 2 annotations with MAE+Dice loss has been roughly on par, if not better, with those trained on the expert annotations. This is especially true for the LV. For the Non-Expert 1, most likely due to a lack of proper training, results on both test sets and most MYO and RV metrics are worst than that of the expert. In fact, a statistical test reveal that results from the Non-Expert 1 are almost always statistically different than that of the expert. We also evaluated the statistical difference between the CE+Dice and the MAE+Dice losses and observed that the MAE+Dice loss provides overall better results for both non experts.
Overall on M&Ms, while the expert got a better MYO mass error and a better RV EF error, the MYO HD is lower for Non-Expert 2 and the Dice score and the RV HD of Non-Expert 2 are statistically similar. These results underline the idea that well-trained non-expert and expert annotations could be used interchangeably to build reliable annotated cardiac datasets. In contrast, the number of non-experts we evaluated might be considered a limitation of our study, however, this still provides encouraging results for settings where experts are not readily available to annotate whole datasets, but could provide training to a non-expert to effectively annotate in their stead. We leave the investigation on more datasets to future works that could transpose the setup to more difficult problems and a larger number of non-experts. Our work comes to supplement previous endavours that rely on non-experts to annotate medical datasets, Heim et al. Heim et al. (2018) showcased the ability of crowdsourced expertise to reliably annotate liver dataset although their approach proposes initial segmentations to the non-expert which might bias their decision. Likewise, Ganz et al. Ganz et al. (2016) proposed to make use of non-experts as a crowdsourced error detection framework. In contrast, our approach evaluated the effectiveness of non-expert knowledge without any prior input. This further reinforces the idea that crowdsourced medical annotation are a viable solution to the lack of data.
In this work, we studied the usefulness of training deep learning models with non-expert annotations for the segmentation of cardiac MR images. The need for medical experts was probed in a comparative study with non-physician sourced labels. Through framing the problem of relying on non-expert annotations as noisy data, we managed to obtain good performance on two public datasets, one of which was used to emulate an out-of-distribution dataset. We found that training a deep neural network, regardless of its capacity (U-Net, Attention U-Net or ENet), with data labeled by a well-trained non-expert achieves comparable performance than on expert data. Moreover, the performance gap between the networks with non-expert and expert annotations on the out-of-distribution dataset was less pronounced than the gap on the training dataset. Future endeavors could focus on crowd sourcing large-scale medical datasets and tailoring approaches that take their noisiness into account.
Y.S: writing manuscript, developing software, experiment design, performing experiment.
P-M.J : initial idea, writing manuscript, experiment design, data analysis.
A.L: initial idea, writing manuscript, resource management, data analysis.
This research received no external funding
Acknowledgements.We would like to acknowledge Ivan Porcherot for the tremendous work he did annotating the datasets. The authors declare no conflict of interest. References
- Litjens et al. (2017) Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Medical Image Analysis 2017, 42, 60–88.
- Bernard et al. (2018) Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; others. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE transactions on medical imaging 2018, 37, 2514–2525.
- N.Painchaud et al. (2020) N.Painchaud.; Y.Skandarani.; T.Judge.; O.Bernard.; Jodoin, A.P.M. Cardiac Segmentation with Strong Anatomical Guarantees. IEEE Transaction on Medical Imaging 2020, 39.
- Venkataramani et al. (2019) Venkataramani, R.; Ravishankar, H.; Anamandra, S. Towards Continuous Domain Adaptation For Medical Imaging. 2019 IEEE 16th International Symposium on Biomedical Imaging, 2019, pp. 443–446.
- Maier-Hein et al. (2018) Maier-Hein, L.; Eisenmann, M.; Reinke, A.; Onogur, S.; Stankovic, M.; Scholz, P.; Arbel, T.; Bogunović, H.; Bradley, A.; Carass, A.; Feldmann, C.; Frangi, A.; Full, P.M.; van Ginneken, B.; Hanbury, A.; Honauer, K.; Kozubek, M.; Landman, B.; März, K.; Maier, O.; Maier-Hein, K.; Menze, B.; Müller, H.; Neher, P.; Niessen, W.; Rajpoot, N.; Sharp, G.; Sirinukunwattana, K.; Speidel, S.; Stock, C.; Stoyanov, D.; Taha, A.A.; van der Sommen, F.; Wang, C.W.; Weber, M.; Zheng, G.; Jannin, P.; Kopp-Schneider, A. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications 2018, 9.
- dat (2021) Dataset list - A list of the biggest machine learning datasets, 2021.
- mtu (2021) Amazon Mechanical Turk, 2021.
- Karimi et al. (2020) Karimi, D.; Dou, H.; Warfield, S.K.; Gholipour, A. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical Image Analysis 2020, 65, 101759.
- Can et al. (2018) Can, Y.B.; Chaitanya, K.; Mustafa, B.; Koch, L.M.; Konukoglu, E.; Baumgartner, C.F. Learning to Segment Medical Images with Scribble-Supervision Alone. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer International Publishing, 2018, pp. 236–244.
- Choudhary et al. (2020) Choudhary, A.; Tong, L.; Zhu, Y.; Wang, M. Advancing Medical Imaging Informatics by Deep Learning-Based Domain Adaptation. Yearbook of Medical Informatics 2020, 29, 129 – 138.
- Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 2014, pp. 2672–2680.
- Skandarani et al. (2020) Skandarani, Y.; Painchaud, N.; Jodoin, P.M.; Lalande, A. On the effectiveness of GAN generated cardiac MRIs for segmentation. Medical Imaging with Deep Learning 2020.
- Girum et al. (2020) Girum, K.B.; Créhange, G.; Hussain, R.; Lalande, A. Fast interactive medical image segmentation with weakly supervised deep learning method. International Journal of Computer Assisted Radiology and Surgery 2020, 15, 1437–1444.
- Sun et al. (2017) Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 843–852.
- Ronneberger et al. (2015) Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241.
Ghosh et al. (2017)
Ghosh, A.; Kumar, H.; Sastry, P.
Robust loss functions under label noise for deep neural networks.
Proceedings of the AAAI Conference on Artificial Intelligence, 2017, Vol. 31.
- (17) Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The M&Ms Challenge.
- Heim et al. (2018) Heim, E.; Roß, T.; Seitel, A.; März, K.; Stieltjes, B.; Eisenmann, M.; Lebert, J.; Metzger, J.; Sommer, G. Large-scale medical image annotation with crowd-powered algorithms. Journal of Medical Imaging 2018, 5, 1.
- Ganz et al. (2016) Ganz, M.; Kondermann, D.; Andrulis, J.; Knudsen, G. M.; Maier-Hein, L. Crowdsourcing for error detection in cortical surface delineations. International Journal of Computer Assisted Radiology and Surgery 2016, 12, 161–166.
- Oktay et al. (2018) Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.J.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.; Kainz, B.; Glocker, B.; Rueckert, D. Attention U-Net: Learning Where to Look for the Pancreas. Medical Imaging with Deep Learning 2018.
- Paszke et al. (2016) Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint, 2016, arXiv:1606.02147.