Semi-automated labelling of medical images: benefits of a collaborative work in the evaluation of prostate cancer in MRI

08/29/2017 ∙ by Christian Mata, et al. ∙ Université de Bourgogne 0

Purpose: The goal of this study is to show the advantage of a collaborative work in the annotation and evaluation of prostate cancer tissues from T2-weighted MRI compared to the commonly used double blind evaluation. Methods: The variability of medical findings focused on the prostate gland (central gland, peripheral and tumoural zones) by two independent experts was firstly evaluated, and secondly compared with a consensus of these two experts. Using a prostate MRI database, experts drew regions of interest (ROIs) corresponding to healthy prostate (peripheral and central zones) and cancer using a semi-automated tool. One of the experts then drew the ROI with knowledge of the other expert's ROI. Results: The surface area of each ROI as the Hausdorff distance and the Dice coefficient for each contour were evaluated between the different experiments, taking the drawing of the second expert as the reference. The results showed that the significant differences between the two experts became non-significant with a collaborative work. Conclusions: This study shows that collaborative work with a dedicated tool allows a better consensus between expertise than using a double blind evaluation. Although we show this for prostate cancer evaluation in T2-weighted MRI, the results of this research can be extrapolated to other diseases and kind of medical images.



There are no comments yet.


page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The annotation of medical images is subject to an inherent inter-variability between experts, and in some cases, there are also significant differences between the own annotations of the same expert (intra-variability). This difficulty in annotating the medical findings is due to different reasons, including the own images (difficult to understand, low resolution and / or subtle changes, …) the expert that is performing the annotations (experience, tiredness, …) and the working conditions (monitor, annotating device, illuminance, …). It is commonly accepted that a way to reduce the variabilities is by doing the overlap between the annotations performed by different experts that performed them blindly respect to the other experts. In this paper we show that using a collaborative approach, the variabilities between experts can be even more minimised.

Figure 1: Example of a PCa processing from E1 (left image) and from E2 (right image). Notice that the contours are similar for and .

We centred our study in the analysis of prostate cancer. Prostate cancer (PCa) remains one of the most commonly diagnosed solid tumours among men. In the United States, there are an estimated 180,890 new cases and 26,120 deaths in 2016 

(Siegel et al., 2014). Simply speaking the prostate is composed of the peripheral zone (PZ), the transitional zone (TZ) and the central zone (CZ). Most cancer lesions occur in the peripheral zone of the gland. A detailed description of the influence of the prevalence factor risk according the prostate zone is defined in De Marzo et al. (2007). Among the techniques used to detect PCa, Magnetic Resonance Imaging (MRI) allows the non-invasive analysis of the anatomy and the metabolism in the entire prostate gland. MRI has been established as the best imaging modality for the detection, localisation and staging of PCa on account of its high resolution and excellent spontaneous contrast of soft tissues and the possibilities of multi-planar and multi-parameter scanning (Chen et al., 2008).

The annotations of the prostate were performed by each expert using the ProstateAnalyzer software (Mata et al., 2015). ProstateAnalyzer allows manual drawing of the different regions of interest (ROI) of the prostate, such as CZ, TZ, PZ and tumour lesion, thanks to the combination of MRI techniques (such as oblique axial T2-weighted, diffusion and perfusion imaging) and magnetic resonance spectroscopy. The manual drawing of these regions is of crucial importance for the disease prognosis. However, accurate manual annotations need the combination of MRI techniques (such as oblique axial T2-weighted, diffusion and perfusion imaging), turning this task into a challenging work due to the high volume of information present in these images.

The purpose of the present study is to evaluate the variability between experts concerning medical findings in prostate gland regions. The differences observed for the delimitation of the different ROIs between independent evaluation or using a collaborative work by different users is studied. The idea behind this study is to show that collaborative work allows a real consensus between experts and potentially decreases variabilities in their evaluation. Figure 1(a) presents an example of the prostate gland analysis with manual drawing of the TZ (in white), PZ (in blue) and tumour area (in red). We have asked two experts to make these drawings independently on several MR examinations, and as a second step, one expert repeated the drawings with the knowledge of the evaluation of the other expert. Differences in the drawings such as in the volume calculations were compared in order to verify that there is a significant increase of the consensus in the results with collaborative work.

2 Materials and Methods

2.1 Database

A database with prostate MRI based on clinical data with tumour and healthy cases was used. The examinations used in our study contained three-dimensional T2-weighted fast spin-echo (TR / TE / ETL: 3600 ms / 143 ms / 109, slice thickness: mm) images acquired with sub-millimetric pixel resolution in an oblique axial plane. Each study comprises a set of images.

The annotations were performed using the ProstateAnalyzer tool, that allows the drawing of an annotation in a given MRI modality and automatically it draws the same annotation in the other modalities. We mainly annotate the regions using the T2-weighted image (T2WI). The T2WI modality was chosen because it provides the best depiction of the prostate’s zonal anatomy. However, in cases were the tumoural region was better depicted in the other modalities, these were used to help the annotation. The two experts have more than 10 years of experience working with prostate imaging.

2.2 Evaluation procedure

The evaluation procedure was performed according to different experiments to analyse the three main ROIs (TZ, PZ and tumour lesion (Tum)). Experts have drawn prostate zones corresponding to PZ, TZ and Tum.

The first experiment (E1) consisted of a prostate study evaluation provided by the first expert. It consisted of drawing ROIs of the prostate gland zones when they are required. For each ROI the surface area value was calculated. Considering the whole images for one patient, the volume of each considered tissue was calculated as the surface of each area multiplied by the slice thickness. Similarly, a second experiment (E2) was carried out independently by a second expert in the same manner as E1. Finally, the first expert repeated the same processing with the knowledge of the drawing and the evaluation performed by the second expert (experiment (E3)). The ProstateAnalyzer drawing tool used offers the possibility of showing all the ROIs drawn by every expert. A minimum delay between (E1) and (E3) should be imposed to prevent the expert from remembering his previous tracing. This was greater than one month in our study. The E2 was considered as the experiment of reference. This means that the comparison procedure to evaluate the influence of the collaborative work was done in two steps: firstly, E1 vs E2 and E3 vs E2, and secondly comparison of these two evaluations.

Figure 2: Example of a prostate study evaluation from (a) E1 and (b) E2 with a discordance between both drawings for the tumour area. (c) New evaluation of the prostate study from E3 with a good agreement for the tumour area between and .

2.3 Evaluation parameters

The correlation coefficient, the regression analysis and the Bland Altman 

(Altman and Bland, 1983; Bland and Altman, 1986) plot were used to compare the surfaces obtained with with those obtained with and , respectively. Moreover, a linear correlation estimation between and , as between and

, was performed using a two-sample t-test 

(Rice, 2006). A p-value smaller than was considered as a statistically significant difference. Moreover, the contours obtained from experiment were compared with those obtained with and also with those obtained with . An edge-based approach using the Hausdorff distance (Rote, 1991) and a region-based approach with the Dice index (Guang-Zhong and Tianzi, 2004)

were considered. The mean and the standard deviation of each parameter for the whole data set were calculated. Again, a two-sample t-test was used to verify if there were any significant difference between the calculation of these parameters taking

and , and taking and . Before these analyses, and for each type of tissue, the number of cases in which one expert consider annotating on one image and not the other expert (i.e. corresponding to the upper and lower slices) were counted and presented as a percentage of the total number of processed slices by the second expert.

Patient Processed slides E1 vs. E2 E3 vs. E2 E1 vs. E2 E3 vs. E2 E1 vs. E2 E3 vs. E2
Patient 1 18 6% 6% 11% 0% 0% 0%
Patient 2 21 10% 10% 10% 10% 0% 0%
Patient 3 25 8% 0% 8% 0% 12% 4%
Patient 4 30 7% 0% 7% 0% 17% 0%
Patient 5 24 13% 0% 13% 8% 13% 0%
Patient 6 17 18% 0% 12% 0% 65% 0%
Patient 7 26 12% 4% 8% 0% 19% 0%
Patient 8 31 19% 10% 3% 6% 0% 0%
Patient 9 21 14% 0% 14% 0% 5% 0%
Patient 10 25 16% 0% 8% 0% 8% 0%
Table 1: Counting of the total number of cases for each area (CZ, PZ and Tumour) that have not been evaluated by the two experts between E1 and E2, and E2 and E3.
r Regression line Bland-Altman t-test
E1 vs. E2 E3 vs. E2 E1 vs. E2 E3 vs. E2 E1 vs. E2 E3 vs. E2 E1 vs. E2 E3 vs. E2
CZ 0.95 0.98 y = 0.9x - 166 y = x - 12 -261.13 168.20 -13.07 118.09 0.01 0.36
PZ 0.91 0.94 y = 0.9x - 96 y = 0.9x + 21 -156.50 95.71 -10.73 84.60 0.01 0.32
TUM 0.96 0.98 y = 0.7x - 3 y = x + 3 -54.93 64.34 -0.08 27.13 0.02 0.47
Table 2: Analysis of CZ, PZ and the tumour (TUM) area calculated (in ) found in the prostate gland using the surface as the anatomical parameter. is the reference and is compared with and .

3 Results

Two examples of a PCa analysis are presented in Figures 1 and 2. The left image in Figure 1 corresponds to the drawing by the first expert and in the right image by the second expert . Three ROIs are drawn in both images corresponding to CZ (white), PZ (blue) and Tumour (red). When visually comparing the two drawings, a very good concordance between CZ and PZ areas can be observed. Concerning the tumourous area, a small deviation is found but contours could be considered as relativity close between the two experiments.

However, not all the prostate studies were evaluated with such as good concordance between experiments. An example of discordance is shown in Figure 2. CZ and PZ have a good correspondence between and but an important discordance is observed for the tumour area. Then, a new evaluation of a prostate study for concerning the tumour area is shown in Figure 2 (c). In this example, we see the real advantage of collaborative work. According to the two experts after adjustments, the tumour area is similar.

3.1 Anatomic parameters

Table 1 describes for each tissue the number of cases in which one expert drew one area and not the other. It is represented as a percentage of the total number of slices processed by the second expert. Considering all the data sets (then considering all the patients), a percentage of the total cases for both experiments is calculated. For CZ, this percentage is equal to 12% between vs. and 3% between vs. . In the case of PZ, this percentage is equal to 9% between vs. and 3% between vs. . Finally, for the tumour it is 13% between vs. and 0% between vs. . It can be seen that in the second comparison between vs. the number of discordance for each area is drastically reduced.

The correlation coefficient , regression line, Bland Altman and two-sample t-test calculated for the three zones are presented in Table 2. In general, the results have been improved between vs. compared with vs. . The Bland-Altman test shows a better agreement between vs. than vs. , whatever the area. Bland-Altman is also calculated for the volume evaluation. For CZ it is between vs. and between vs. , for the PZ it is between vs. and between vs. , and for tumour it is between vs. and between vs . According to the two-sample t-test, there is no significant overlap between vs. whatever the considered area, while there are always significant differences in the results between vs.

Figure 3: Regression analysis obtained between (a) E1 vs. E2 and (b) E2 vs. E3 and the corresponding Bland-Altman plots obtained between (c) E1 vs. E2 and (d) E2 vs. E3 concerning the calculation of the surface of a tumour.

Figure 3

(a) and (b) illustrate the linear regression analysis concerning the evaluation of the tumour area. The tumour area has been chosen due to its importance and because this area is more difficult to analyze and provides more variations among experts. When comparing the two obtained regression lines, an improvement is noted in Figure 

3(b)[E2 Tum] with a slope of compared with Figure 3(a)[E2 Tum] with a slope of . In the corresponding Bland-Altman plots, in Figure 3(d) it can be seen that the mean of the difference between vs. is close to zero, meaning that there is little bias between the two measurements. Moreover, there is a decrease of the standard deviation.

3.2 Contour evaluation

The Hausdorff distance and the Dice index between the different drawings are presented in Table 3. Again, between vs. , an improvement is observed with respect to the results obtained between vs. . The mean Hausdorff distance is reduced in all the cases. In the same way, the analysis of the Dice Index is around between vs. whatever the area, whereas is not this the case for vs. . The differences between vs. and vs. are all significant, whatever the considered parameter.

4 Discussion

In this paper we analyse the ground-truth obtention of a prostate cancer analysis using MRI by two different experts. In contrast of what is currently done for obtaining a ground-truth, where the evaluation of the different experts is independently done and afterwards objectively (or subjectively) merged, in this paper, we study the interest of a collaborative work, where the ground-truth is obtained by the two experts, but with the second one knowing the opinion of the first one. Exhaustive evaluations of the medical findings in different regions of the prostate gland from T2WI for several experiments using our dedicated tool application have been performed. ROIs were drawn in the prostate gland (TZ, PZ and tumour area if present) on images from a prostate MRI database. We proved that using the collaborative approach, the variability between the experts is significantly reduced.

Hausdorff distance Dice Index
E1 vs. E2 E3 vs. E2 E1 vs. E2 E3 vs. E2
CZ 8 3 4 1 0.7 0.2 0.9 0.1
PZ 11 5 5 2 0.6 0.2 0.9 0.1
TUM 10 4 8 11 0.7 0.1 0.9 0.1
Table 3: Analyses of Hausdorff distance (in ) and Dice index for the CZ, PZ and tumour area (TUM). A of between vs and vs is found in all the cases.

Even if the results are expected, this study shows that evaluation of medical examinations with a knowledge of the other expert reduces drastically the differences between processing. In particular, significant differences between the two experts become non-significant when there is a collaborative work. We probably cannot conclude that the diagnosis was improved, but we will agree that there is a consensus between experts. In general, this involves an increase in the quality of the diagnosis. An alternative point of view would be to affirm that the experiment E3 is biased. An additional experiment could be performed: the second expert could repeat the process knowing what did the first one. However, in our opinion, this is not necessary because the main objective of this study is to objectively show that collaborative work in current clinical practice can provide a real consensus between experts even if there is potentially a bias in the evaluation process.

In a previous work we also analysed the state of the art in automatic prostate segmentation (Ghose et al., 2012). In that work we drew a set of open problems mainly related with the evaluation procedure. This can be summarised as (1) variabilities in the ground-truth, (2) unavailability of public prostate datasets, and (3) lack os standardised metrics for evaluation. We believe the work presented in this paper presents the roots for designing a proper dataset for prostate evaluation. The collaborative work explained in this study is the first step for obtaining a reliable ground-truth, without expert variabilities, which automatic algorithms could be robustly compared.

As a conclusion, although collaborative work requires more time, it allows the improvement of the management of patients with prostate cancer by providing consensual diagnosis, in particular in complex cases.


This work was partially funded by the Spanish R+D+I grant n. TIN2012-37171-C02-01, by UdG grant MPCUdG2016/022. The Regional Council of Burgundy under the PARI 1 scheme also sponsored this work. C. Mata held a Mediterranean Office for Youth mobility grant.


  • Siegel et al. (2014) Siegel, R., Ma, J., Zhaohui, Z., Jemal, A.. Cancer statistics, 2014. CA: Cancer J Clin 2014;64(9–29).
  • De Marzo et al. (2007) De Marzo, A., Platz, E., Sutcliffe, S., Xu, J., Grönberg H. Drake, C., Nakai, Y., et al. Inflammation in prostate carcinogenesis. Nat Rev Cancer 2007;7:256–269.
  • Chen et al. (2008) Chen, M., Dang, H., Wang, J., Zhou, C., Li, S., Wang, W., et al. Prostate cancer detection: comparison of t2-weighted imaging, diffusion-weighted imaging, proton magnetic resonance spectroscopic imaging, and the three techniques combined. Acta Radiol 2008;49(5):602–610.
  • Mata et al. (2015) Mata, C., Walker, P., Oliver, A., Brunotte, F., Martí, J., Lalande, A.. Prostateanalyzer: web-based medical application for the management of prostate cancer using multiparametric mr images. Informatics for Health & Social Care 2015;:1–21.
  • Altman and Bland (1983) Altman, D.G., Bland, J.M.. Measurement in medicine: The analysis of method comparison studies. The Statistician 1983;32(3):307–317.
  • Bland and Altman (1986) Bland, J., Altman, D.. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;327(8476):307–310.
  • Rice (2006) Rice, J.. Mathematical Statistics and Data Analysis. Third Edition, Duxbury Advanced; 2006.
  • Rote (1991) Rote, G.. Computing the minimum hausdorff distance between two point sets on a line under translation. Inform Proces Letters 1991;38(3):123–127.
  • Guang-Zhong and Tianzi (2004) Guang-Zhong, y., Tianzi, J.. Medical imaging and augmented reality: second international workshop, MIAR 2004.; vol. 3150 of Lecture Notes in Computer Science. Springer-Verlag Berlin Heidelberg; 2004. ISBN 9783540228776.
  • Ghose et al. (2012) Ghose, S., Oliver, A., Martí, R., Lladó, X., Vilanova, J., Freixenet, J., et al. A survey of prostate segmentation methodologies in ultrasound, magnetic resonance and computed tomography images. Computer Methods and Programs in Biomedicine 2012;108(1):262 – 287.