Radiomic Feature Stability Analysis based on Probabilistic Segmentations

10/13/2019 ∙ by Christoph Haarburger, et al. ∙ 0

Identifying image features that are robust with respect to segmentation variability and domain shift is a tough challenge in radiomics. So far, this problem has mainly been tackled in test-retest analyses. In this work we analyze radiomics feature stability based on probabilistic automated segmentation hypotheses. Based on a public lung cancer dataset, we generate an arbitrary number of plausible segmentations using a Probabilistic U-Net. From these segmentations, we extract a high number of plausible feature vectors for each lung tumor and analyze feature variance with respect to the segmentations. Our results suggest that there are groups of radiomic features that are more (e.g. statistics features) and less (e.g. gray-level size zone matrix features) robust against segmentation variability. Finally, we demonstrate that segmentation variance impacts the performance of a prognostic lung cancer survival model and propose a new and potentially more robust radiomics feature selection workflow.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Radiomics has been increasingly applied to radio- and oncological image data [3, 11, 8]

. One of the main problems with the use of radiomics lies in the curse of dimensionality: Most studies are conducted on several hundred images, while thousands of image features are extracted. Several problems arise from this setup: Firstly, by extracting such high numbers of features, despite the use of feature selection algorithms, multicollinearity and overfitting may lead to limited reproducibility of radiomic signatures 

[18, 6]. While image feature definitions are increasingly standardized [20, 17], other problems lie in differences in image reconstruction [13] and in how the segmentations based on which the features are extracted, are generated. In contrast to many claims [9, 19], most current radiomics pipelines are not truly quantitative because segmentations are performed by humans and are thus subject to inter- and intra-rater variability [10]. We hypothesize that this limits the validity and reproducibility of radiomic signatures, even when they are based on expert segmentations.

Zwanenburg et al. [1] have already assessed robustness of features with respect to image perturbations such as translation, rotation and noise addition. Aerts et al. [3] and Peerlings et al. [14] investigated feature stability in test-retest studies. In a recent review article [16], Traverso et al. concluded that there is currently no consensus regarding the question which features are optimal in terms of reproducibility. In order to quantify the degree to which radiomics features depend on the particular segmentation, we propose to perform a probabilistic automated segmentation that generates a set of plausible segmentations rather than one or few manual segmentations by experts. Feature vectors are then extracted for each segmentation indidually, in order to assess robustness of individual features with respect to segmentation variability.

The set of plausible segmentations is generated by an extension of the U-Net [15], the Probabilistic U-Net(PU-Net) [12]

, which combines the U-Net with a conditional variational autoencoder (CVAE). With this architecture, plausible segmentations can be sampled from the latent space of the CVAE.

Based on these, we extract a high number of radiomics features and assess feature variance with respect to a set of plausible segmentation masks. We identify groups of features that are invariant with respect to the particular segmentation and others that depend on it more heavily. Furthermore, we show that segmentation variance influences the performance of a prognostic survival model on a public lung cancer dataset.

2 Material and Methods

2.1 Image Data

All evaluations are performed based on two publicly available lung cancer datasets: Specifically, the LIDC-IDRI dataset [5, 4] is used to train the PU-Net, whereas feature stability is assessed on the Maastro Lung1 dataset [3, 2], which are both publicly available at The Cancer Imaging Archive (TCIA) [7]. Expert segmentations for all 422 cases of the Lung1 dataset are also publicly available111 An example image with a corresponding expert segmentation is depicted in Fig. 0(a). Given the CT scans, expert annotations and right-censored survival data, the Maastro Lung1 dataset has been used for radiomics-based lung cancer survival analysis [3, 18].

(a) Ground Truth
(b) Segmentation
(c) Segmentation
(d) Segmentation
(e) Segmentation
(f) Segmentation
Figure 1: Ground truth (a) and examples for corresponding probabilistic segmentations (b – f). The dice scores refer to the ground truth.

2.2 Segmentation using Probabilistic U-Net

The PU-Net [12] is an extension of the popular U-Net architecture [15] that models the distribution of plausible segmentations using a CVAE. By sampling from a low-dimensional latent space vector, plausible segmentation hypotheses can be generated arbitrarily which can be interpreted as an equivalent of asking a human experts to perform a manual segmentations. We trained the PU-Net as implemented in 222 using a latent vector with as suggested in the original PU-Net publication [12]. The network operated on 2D axial slices.

After training the PU-Net on the LIDC-IDRI dataset, 1000 plausible 2D segmentation masks were generated for every axial slice from the Maastro Lung1 dataset. Unfortunately, for 72 cases, the probabilistic segmentation failed on several slices producing no segmentation at all. By visual inspection we were not able to identify any pattern signifying the failure of generating a segmentation. These cases were excluded from this study.

2.3 Feature Extraction

Using the PU-Net, we generated up to 1000 plausible segmentations for each slice. This initial set of segmentations was then reduced to a set of unique segmentations resulting in dozens to few hundred segmentations. For this reduced set, we calculated the Dice score

between each probabilistic segmentation and the ground truth and sampled 25 segmentations with a uniform distribution with respect to the Dice scores related to the ground truth. This step is needed to ensure that the segmentations actually differ because even in the reduced set of segmentations generated by the PU-Net, many are almost identical. The particular choice of 25 segmentations was chosen based on visually inspection considering segmentation diversity and was not optimized further.

Next, for each nodule, 25 feature vectors, based on the 25 segmentations, were extracted using the PyRadiomics framework [17] (Version 2.2.0). Specifically, we extracted:

  • 18 statistics features

  • 15 shape features

  • 22 gray level co-occurence matrix (GLCM) features

  • 16 gray level size zone matrix (GLSZM) features

  • 16 gray level run length matrix (GLRLM) features

  • 5 neighborhood difference gray tone matrix (NDGTM) features

For feature extraction, all images were resampled to a voxel spacing of 1

11  and a bin width of 25 was used for grey value binning. Moreover, we extracted not only the image features based on original CT images but also on wavelet-transformed using the standard wavelets transforms implemented in PyRadiomics.

3 Results

For 73 out of the 422 patients, the PU-Net did not produce segmentation hypotheses as explained in Section 2.3 but produced empty masks, which lead to an exclusion of these cases from our study.

Feature stability was assessed by calculating the intraclass correlation coefficient (ICC) for each feature across all 25 segmentations. ICCs grouped by feature categories are provided in Fig. 2. In Fig 3, we provide the same ICCs clustered by image transform (no transform, wavelet transforms).

Figure 2: ICCs for all features clustered by feature category.
Figure 3: ICCs for all features clustered by image transform.

In Fig. 2 we can see that most statistics, shape, GLCM and NGTDM features have a very high ICC 0.9. Many GLRLM and especially GLSZM features have a considerably lower ICCs that can be as low as 0.2. If the features are grouped by image transform as in Fig. 2 it becomes obvious that wavelet features are generally subject to higher segmentation dependence than features extracted from raw CT images. Overall, 28.7% of all features had an ICC 0.9.

In order to demonstrate the relation between the prognostic value and the stability a feature, in Fig 4, each for each feature the univariate concordance index (cindex) is plotted against the stability rank. Cindex is a widely-used performance measure in survival analysis and defined as


Stability rank is defined as a descending ranking of all features based on ICC such that the feature with the highest ICC has a stability rank of 1.

Figure 4: Stability rank (x-axis) and corresponding concordance index (cindex) for all features.

The last experiment is based on the radiomic signature by Aerts et al. based on a Cox model and the four features identified in [3]. We extracted radiomic signatures based on this model for all 25 segmentations and calculated the corresponding cindices, which are depicted in Fig. 5 as a histogram. Cindices vary between 0.569 and 0.577. In comparision, using the expert segmentation, a cindex of 0.574 is achieved.

Figure 5: Histogram of cindices using a Cox model based on the Aerts signature [3] over 25 probabilistic segmentations.

4 Discussion

We have trained a probabilistic segmentation algorithm that provides segmentation hypotheses that can be used for extracting radiomics features. Based on the feature vectors, we have analyzed feature stability with respect to varying segmentation masks. The results show that there are groups of radiomics features that are subject to higher and lower variance across segmentations, respectively. This is in line with other works in which stability of feature vectors originating from multiple manual segmentations by experts were evaluated [3, 14]. Moreover, we were able to show in Fig 4 that variance in segmentations carries over to a prognostic model. However, the variance with respect to the cindex across segmentations is relatively small, indicating that the signature by Aerts et al. is relatively robust against segmentation variance. Our cindices are generally lower than in other publications because 73 patients out of the 422 patients had to be excluded as the PU-Net failed to produce segmentation hypotheses in these casees.

There is currently no consensus on which ICC cutoff should be used to exclude features from further analyses [16], however, assuming a cutoff of 0.9, about one third of all features in our analysis could be discarded. Thus, in a standard pipeline for radiomic signature development, the curse of dimensionality and multicollinearity could be considerably alleviated, which may lead to radiomic signatures that are more robust and reproducible. Moreover, feature scores could be averaged over segmentation mask hypotheses to further improve robustness.

Based on these findings, rather than “just” extracting radiomics features from a single expert segmentation, we envision that a future radiomic signature development pipeline could be composed of the following steps:

  1. Train a probabilistic segmentation algorithm based on the expert segmentation

  2. Generate plausible segmentations for each case

  3. For each feature, calculate ICC with respect to the segmentations

  4. Discard features that are subject to low ICC

  5. For the remaining features, average feature scores over probabilistic segmentations

  6. Run “standard” radiomics pipeline (feature selection, model fit, etc.)

Our study has several limitations: First, the dataset used for feature analysis originates from a single scanner, using a single reconstruction algorithm. Thus, variability arising from protocol and device difference as investigated in [6] is not considered here. Furthermore, the publicly available expert segmentations are prone to errors which might carry over to the generated probabilistic segmentations. Accordingly, the PU-Net segmentations have several shortcomings: First, segmentation is only performed on axial slices in 2D. Moreover, training was partly unstable, producing many similar segmentation hypotheses that had to be filtered out. Finally, for 73 cases, the automatic segmentation using the PU-Net failed completely producing empty masks. These cases had to be excluded from our study and limit the comparability to other works on the same dataset. Finally, our findings are only based on a single dataset from a single modality. In future work, it is desirable to conduct the same experiments on other cancer types and imaging devices to assess feature robustness in a more general sense.

5 Conclusion

Using a set of plausible segmentation hypotheses generated by a PU-Net segmentation algorithm, we analyzed variance of radiomic features with respect varying segmentations, showing that there are groups of image features that are subject to different degrees of robustness. Furthermore, we showed that segmentation variance carries impacts a radiomics survival model on a public lung cancer dataset.


  • [1] Cited by: §1.
  • [2] H. J. W. L. Aerts, E. Rios Velazquez, R. T. H. Leijenaar, C. Parmar, P. Grossmann, S. Carvalho, and Philippe. Lambin (2015) Data from NSCLC-radiomics. In The Cancer Imaging Archive, External Links: Document Cited by: §2.1.
  • [3] H. J. W. L. Aerts, E. R. Velazquez, R. T. H. Leijenaar, C. Parmar, P. Grossmann, S. Cavalho, J. Bussink, R. Monshouwer, B. Haibe-Kains, D. Rietveld, F. Hoebers, M. M. Rietbergen, C. R. Leemans, A. Dekker, J. Quackenbush, R. J. Gillies, and P. Lambin (2014-06) Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Communications 5. External Links: Document, ISSN 2041-1723 Cited by: §1, §1, §2.1, Figure 5, §3, §4.
  • [4] S.G. Armato et al. (2015) Data from lidc-idri. The Cancer Imaging Archive. External Links: Document Cited by: §2.1.
  • [5] S. G. Armato et al. (2011-01) The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical Physics 38 (2), pp. 915–931. External Links: Document, Link Cited by: §2.1.
  • [6] R. Berenguer, M. del Rosario Pastor-Juan, J. Canales-Vázquez, M. Castro-García, M. V. Villas, F. M. Legorburo, and S. Sabater (2018-08) Radiomics of CT features may be nonreproducible and redundant: influence of CT acquisition parameters. Radiology 288 (2), pp. 407–415. External Links: Document, Link Cited by: §1, §4.
  • [7] K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt, M. Pringle, L. Tarbox, and F. Prior (2013-07) The cancer imaging archive (TCIA): maintaining and operating a public information repository. Journal of Digital Imaging 26 (6), pp. 1045–1057. External Links: Document, Link Cited by: §2.1.
  • [8] E. de la Rosa, D. M. Sima, T. V. Vyvere, J. S. Kirschke, and B. Menze (2019-04) A radiomics approach to traumatic brain injury prediction in CT scans. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), External Links: Document, Link Cited by: §1.
  • [9] R. J. Gillies, P. E. Kinahan, and H. Hricak (2016) Radiomics: images are more than pictures, they are data. Radiology 278 (2), pp. 563–577. Note: PMID: 26579733 External Links: Document, Cited by: §1.
  • [10] L. Joskowicz, D. Cohen, N. Caplan, and J. Sosna (2019-03-01) Inter-observer variability of manual contour delineation of structures in ct. European Radiology 29 (3), pp. 1391–1399. External Links: Document, ISSN 1432-1084, Link Cited by: §1.
  • [11] P. Kickingereder, S. Burth, A. Wick, M. Götz, O. Eidel, H. Schlemmer, K. H. Maier-Hein, W. Wick, M. Bendszus, A. Radbruch, and D. Bonekamp (2016-09) Radiomic profiling of glioblastoma: identifying an imaging predictor of patient survival with improved performance over established clinical and radiologic risk models. Radiology 280 (3), pp. 880–889. External Links: Document Cited by: §1.
  • [12] S. A. A. Kohl, B. Romera-Paredes, C. Meyer, J. D. Fauw, J. R. Ledsam, K. H. Maier-Hein, S. M. A. Eslami, D. J. Rezende, and O. Ronneberger (2018) A Probabilistic U-Net for Segmentation of Ambiguous Images. External Links: arXiv:1806.05034 Cited by: §1, §2.2.
  • [13] M. Meyer, J. Ronald, F. Vernuccio, R. C. Nelson, J. C. Ramirez-Giraldo, J. Solomon, B. N. Patel, E. Samei, and D. Marin (2019-10) Reproducibility of CT radiomic features within the same patient: influence of radiation dose and CT reconstruction settings. Radiology. External Links: Document, Link Cited by: §1.
  • [14] J. Peerlings, H. C. Woodruff, J. M. Winfield, A. Ibrahim, B. E. V. Beers, A. Heerschap, A. Jackson, J. E. Wildberger, F. M. Mottaghy, N. M. DeSouza, and P. Lambin (2019-03) Stability of radiomics features in apparent diffusion coefficient maps from a multi-centre test-retest trial. Scientific Reports 9 (1). External Links: Document, Link Cited by: §1, §4.
  • [15] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI, pp. 234–241. Cited by: §1, §2.2.
  • [16] A. Traverso, L. Wee, A. Dekker, and R. Gillies (2018-11) Repeatability and reproducibility of radiomic features: a systematic review. International Journal of Radiation Oncology 102 (4), pp. 1143–1158. External Links: Document, Link Cited by: §1, §4.
  • [17] J. J.M. van Griethuysen, A. Fedorov, C. Parmar, A. Hosny, N. Aucoin, V. Narayan, R. G.H. Beets-Tan, J. Fillion-Robin, S. Pieper, and H. J.W.L. Aerts (2017-10) Computational radiomics system to decode the radiographic phenotype. Cancer Research 77 (21), pp. e104–e107. External Links: Document Cited by: §1, §2.3.
  • [18] M. L. Welch, C. McIntosh, B. Haibe-Kains, M. F. Milosevic, L. Wee, A. Dekker, S. H. Huang, T. G. Purdie, B. O’Sullivan, H. J.W.L. Aerts, and D. A. Jaffray (2018) Vulnerabilities of radiomic signature development: the need for safeguards. Radiotherapy and Oncology. External Links: Document, ISSN 0167-8140 Cited by: §1, §2.1.
  • [19] S. S. F. Yip and H. J. W. L. Aerts (2016) Applications and limitations of radiomics. Physics in Medicine and Biology 61 (13), pp. R150. Cited by: §1.
  • [20] A. Zwanenburg, S. Leger, M. Vallières, and S. Löck (2016-12) Image biomarker standardisation initiative - feature definitions. External Links: 1612.07003 Cited by: §1.