Log In Sign Up

Reproducible radiomics through automated machine learning validated on twelve clinical applications

Radiomics uses quantitative medical imaging features to predict clinical outcomes. While many radiomics methods have been described in the literature, these are generally designed for a single application. The aim of this study is to generalize radiomics across applications by proposing a framework to automatically construct and optimize the radiomics workflow per application. To this end, we formulate radiomics as a modular workflow, consisting of several components: image and segmentation preprocessing, feature extraction, feature and sample preprocessing, and machine learning. For each component, a collection of common algorithms is included. To optimize the workflow per application, we employ automated machine learning using a random search and ensembling. We evaluate our method in twelve different clinical applications, resulting in the following area under the curves: 1) liposarcoma (0.83); 2) desmoid-type fibromatosis (0.82); 3) primary liver tumors (0.81); 4) gastrointestinal stromal tumors (0.77); 5) colorectal liver metastases (0.68); 6) melanoma metastases (0.51); 7) hepatocellular carcinoma (0.75); 8) mesenteric fibrosis (0.81); 9) prostate cancer (0.72); 10) glioma (0.70); 11) Alzheimer's disease (0.87); and 12) head and neck cancer (0.84). Concluding, our method fully automatically constructs and optimizes the radiomics workflow, thereby streamlining the search for radiomics biomarkers in new applications. To facilitate reproducibility and future research, we publicly release six datasets, the software implementation of our framework (open-source), and the code to reproduce this study.


page 12

page 16

page 25


Surfboard: Audio Feature Extraction for Modern Machine Learning

We introduce Surfboard, an open-source Python library for extracting aud...

sql4ml A declarative end-to-end workflow for machine learning

We present sql4ml, a system for expressing supervised machine learning (...

Open-radiomics: A Research Protocol to Make Radiomics-based Machine Learning Pipelines Reproducible

The application of artificial intelligence (AI) techniques to medical im...

A Machine Learning Challenge for Prognostic Modelling in Head and Neck Cancer Using Multi-modal Data

Accurate prognosis for an individual patient is a key component of preci...

Autonomous optimization of nonaqueous battery electrolytes via robotic experimentation and machine learning

In this work, we introduce a novel workflow that couples robotics to mac...

In-situ Workflow Auto-tuning via Combining Performance Models of Component Applications

In-situ parallel workflows couple multiple component applications, such ...

1 Introduction

In the last decades, there has been a paradigm shift in health care, moving from a reactive, one-size-fits-all approach, towards a more proactive, personalized approach Hood and Friend (2011); Chan and Ginsburg (2011); Hamburg and Collins (2010). To aid in this process, personalized medicine generally involves clinical decision support systems such as biomarkers, which relate specific patient characteristics to some biological state, outcome or condition. To develop such biomarkers, medical imaging has gained an increasingly important role (Hood and Friend, 2011; O’Connor et al., 2017). Currently, in clinical practice, medical imaging is assessed by radiologists, which is generally qualitative and observer dependent. Therefore, there is a need for quantitative, objective biomarkers to leverage the full potential of medical imaging for personalized medicine to improve patient care.

To this end, machine learning, both using conventional and deep learning methods, has shown to be highly successful for medical image classification and has thus become the

de facto standard. Within the field of radiology, the term “radiomics” has been coined to describe the use of a large number of quantitative medical imaging features in combination with (typically conventional) machine learning to create biomarkers (Lambin et al., 2012). Predictions for example relate to diagnosis, prognosis, histology, treatment planning (e.g. chemotherapy, radiotherapy, immunotherapy), treatment response, drug usage, surgery, and genetic mutations. The rise in popularity of radiomics has resulted in a large number of papers and a wide variety of methods (Yip and Aerts, 2016; Rizzo et al., 2018; Traverso et al., 2018; Sollini et al., 2019; Afshar et al., 2019; Parekh and Jacobs, 2019; Bodalal et al., 2019; Song et al., 2020; Avanzo et al., 2020; Aerts, 2016; Guiot et al., 2021)

. In a new radiomics application, finding the optimal method out of the wide range of available options has to be done manually through a heuristic trial-and-error process. This process has several disadvantages, as it: 1) is time-consuming; 2) requires expert knowledge; 3) does not guarantee that an optimal solution is found; 4) negatively affects the reproducibility; 5) has a high risk of overfitting when not carefully conducted

(Hosseini et al., 2020; Song et al., 2020); and 6) limits the translation to clinical practice (Sollini et al., 2019).

The aim of this study is to streamline radiomics research, facilitate radiomics’ reproducibility, and simplify its application by proposing a framework to fully automatically construct and optimize the radiomics workflow per application. Most published radiomics methods roughly consist of the same steps: image segmentation, preprocessing, feature extraction, and classification. Hence, as radiomics methods show substantial overlap, we hypothesize that it should be possible to automatically find the optimal radiomics model in a new clinical application by collecting numerous methods in one single framework and systematically comparing and combining all included components.

To optimize the radiomics workflow, we exploit recent advances in automated machine learning (AutoML) (Hutter et al., 2019)

. We define a radiomics workflow as a specific combination of algorithms and their associated hyperparameters, i.e., parameters that need to be set before the actual learning step. To create a modular design, we standardize the components of radiomics workflows, i.e., separating the workflows in components with fixed inputs, functionality, and outputs. For each component, we include a large number of algorithms and their associated hyperparameters. We focus on conventional radiomics pipelines, i.e., using conventional machine learning, for the following reasons: 1) radiomics methods are quick to train, hence AutoML is feasible to apply; 2) the radiomics search space is relatively clear, as radiomics workflows typically follow the same steps, further enhancing the feasibility of AutoML; 3) as there is a large number of radiomics papers, the impact of such a method is potentially large; and 4) radiomics is also suitable for small datasets, which is relevant for (rare) oncological applications

(Bodalal et al., 2019; Song et al., 2020). We describe the construction of a radiomics workflow per application as a Combined Algorithm Selection and Hyperparameter (CASH) optimization problem (Thornton et al., 2013), in which we include both the choice of algorithms and their associated hyperparameters. The CASH problem is solved through a brute-force randomized search, identifying the most promising workflows. To boost performance and stability, an ensemble is taken over the most promising workflows to combine them in a single model. Through this use of adaptive workflow optimization, our framework automatically constructs and optimizes the radiomics workflow for each application.

To validate our approach and evaluate its generalizability, we evaluate our framework on twelve different clinical applications using three publicly available datasets and nine in-house datasets. To facilitate reproducibility, six of the in-house datasets with data of in total 930 patients are publicly released with this paper (Starmans et al., 2021c). To further facilitate reproducibility, we have made the software implementation of our method, and the code to perform our experiments on all datasets open-source (Starmans et al., 2018a; Starmans, 2021).

1.1 Background: Radiomics

To outline the context of this study, we here present some background on typical radiomics studies. Generally, a radiomics study can be seen as a collection of various steps: data acquisition and preparation, segmentation, feature extraction, and data mining (Starmans et al., 2020c). In this study, we consider the data, i.e., the images, ground truth labels, and segmentations, to be given; data acquisition and segmentation algorithms are therefore outside of the scope of this study.

First, radiomics workflows commonly start with preprocessing of the images and the segmentations to compensate for undesired variations in the data. For example, as radiomics features may be sensitive to image acquisition variations, harmonizing the images may improve the repeatability, reproducibility, and overall performance (Traverso et al., 2018). Examples of preprocessing steps are normalization of the image intensities to a similar scale, or resampling all images (and segmentations) to the same voxel spacing.

Second, quantitative image features are computationally extracted. As most radiomics applications are in oncology, feature extraction algorithms generally focus on describing properties of a specific region of interest, e.g., a tumor, and require a segmentation. Features are typically split in three groups (Parekh and Jacobs, 2016; Zwanenburg et al., 2020)

: 1) first-order or histogram, quantifying intensity distributions; 2) morphology, quantifying shape; and 3) higher-order or texture, quantifying spatial distributions of intensities or specific patterns. Typically, radiomics studies extract hundreds or thousands of features, but eliminate a large part through feature selection in the data mining step. Many open-source toolboxes for radiomics feature extraction exist, such as

MaZda (Szczypiński et al., 2009), CGITA (Fang et al., 2014), CERR (Apte et al., 2018), IBEX (Zhang et al., 2015), PyRadiomics (van Griethuysen et al., 2017), CaPTk (Rathore et al., 2018), LIFEx (Nioche et al., 2018), and RaCat (Pfaehler et al., 2019). A comprehensive overview of radiomics toolboxes can be found in Song et al. (2020).

Lastly, the data mining component may itself consist of a combination of various components: 1) feature imputation; 2) feature scaling; 3) feature selection; 4) dimensionality reduction; 5) resampling; 6) (machine learning) algorithms to find relationships between the remaining features and the clinical labels or outcomes. While these methods are often seen as one component, i.e., the data mining component, we split the data mining step into separate components (

subsection 2.2).

2 Methods

This study focuses on binary classification problems, as these are most common in radiomics (Song et al., 2020).

2.1 Adaptive workflow optimization

The aim of our framework is to automatically construct and optimize the radiomics workflow out of a large number of algorithms and their associated hyperparameters. To this end, we have identified three key requirements. First, as the optimal combination of algorithms may vary per application, our optimization strategy should adapt the workflow per application. Second, while model selection is typically performed before hyperparameter tuning, it has been shown that these two problems are not independent (Thornton et al., 2013). Thus, combined optimization is required. Third, to prevent over-fitting, all optimization should be performed on a training dataset and thereby independent from the test dataset (Hutter et al., 2019; Hosseini et al., 2020; Song et al., 2020). As manual model selection and hyperparameter tuning is not feasible in a large solution space and not reproducible, all optimization should be automatic.

2.1.1 The Combined Algorithm Selection and Hyperparameter (CASH) optimization problem

To address the three identified key requirements, we propose to formulate the complete radiomics workflow as a Combined Algorithm Selection and Hyperparameter (CASH) optimization problem, which previously has been defined in AutoML for machine learning model optimization (Thornton et al., 2013). For a single algorithm, the associated hyperparameter space consists of all possible values of all the associated hyperparameters. In machine learning, given a dataset consisting of features and ground truth labels for objects or samples, and a set of algorithms with associated hyperparameter spaces , the CASH problem is to find the algorithm and associated hyperparameter set that minimize the loss :


where a cross-validation with iterations is used to define subsets of the full dataset for training () and validation (). In order to combine model selection and hyperparameter optimization, the problem can be reformulated as a pure hyperparameter optimization problem by introducing a new hyperparameter that selects between algorithms: (Thornton et al., 2013). Thus, defines which algorithm from and which associated hyperparameter space are used. This results in:


We extend the CASH problem to the complete radiomics workflow, consisting of various components. The parameters of all algorithms are treated as hyperparameters. Furthermore, instead of introducing a single hyperparameter to select between algorithms, we define multiple algorithm selection hyperparameters. Two categories are distinguished: 1) for optional components, an activator hyperparameter is introduced to determine whether the component is actually used or not; and 2) for mandatory components, an integer selector hyperparameter is introduced to select one of the available algorithms. Optional components that contain multiple algorithms have both an activator and selector hyperparameter. We thus reformulate CASH for a collection of algorithm sets and the collection of associated hyperparameter spaces . Including the activator and selector model selection parameters within the hyperparameter collections, similar to Equation 2, this results in:


A schematic overview of the algorithm and hyperparameter search space is shown in Figure 1. The resulting framework is coined WORC (Workflow for Optimal Radiomics Classification). Including new algorithms and hyperparameters in this reformulation is straight-forward, as these can simply be added to and , respectively.

Figure 1:

Schematic overview of the workflow search space in our framework. The search space consists of various sequential sets of algorithms, where each algorithm may include various hyperparameters, as indicated by the leaves in the trees. An example of a workflow, i.e., a specific combination of algorithms and parameters, is indicated by the gray nodes. Abbreviations: AdaBoost: adaptive boosting; ADASYN: adaptive synthetic sampling; KNN: k-nearest neighbor; GLCM: gray level co-occurence matrix; SMOTE: synthetic minority oversampling technique; SVM: support vector machine.

As a loss function

, we use the weighted F1-score, which is the harmonic mean of precision and recall, and thus a class-balanced performance metric:


where the number of classes for binary classification, the number of samples of class , the total number of samples, and and the precision and recall of class , respectively.

As optimization strategy, we use a straightforward random search algorithm, as it is efficient and often performs well (Bergstra and Bengio, 2012). In this random search, workflows are randomly sampled from the search space , and their scores are calculated.

2.1.2 Ensembling

In radiomics studies showing the performance of multiple approaches there is often not a clear winner: many workflows generally have similar predictive accuracy. However, despite having similar overall accuracies, the actual prediction for an individual sample may considerably vary per workflow. Moreover, due to the CASH optimization, the best performing solution is likely to overfit. Hence, by combining different workflows in an ensemble, the performance and generalizability of radiomics models may be improved (Feurer et al., 2019).

Furthermore, ensembling may serve as a form of regularization, as local minimums in the optimization are balanced by the other solutions in the ensemble. When repeating the optimization, due to the randomness of the search, which single workflow performs best and thus the predictions per sample may vary. This especially occurs when using a small number of random searches. An ensemble may therefore lead to a more stable solution of the random search.

Therefore, instead of selecting the single best workflow, we propose to use an ensemble . Various ensembling algorithms have been proposed in literature (Zhang and Ma, 2012). Optimizing the ensemble construction on the training dataset may in itself lead to overfitting. Thus, we propose to use a simple approach of combining a fixed number

of the best performing workflows by averaging their predictions (i.e., the posterior probabilities for binary classification). The workflows are ranked based on their mean

on the validation datasets.

2.1.3 The Worc optimization algorithm

The optimization algorithm of our WORC framework is depicted in Algorithm 1. All optimization is performed on the training dataset by using a random-split cross-validation with , using 80% for training and 20% for validation in a stratified manner, to make sure the distribution of the classes in all sets is similar to the original. A random-split cross-validation is used as this allows a fixed ratio between the training and validation datasets independent of , and is consistent with our evaluation setup (subsection 2.3). The algorithm returns an ensemble .

1:procedure WORC()
2:     for  do
5:     end for
8:     Retrain on full training set
9:     Combine into ensemble
10:     return
11:end procedure
Algorithm 1 The WORC optimization algorithm

2.2 Radiomics Components

In order to formulate radiomics as a CASH problem, the workflow needs to be modular and consist of standardized components. In this way, for each component, a set of algorithms and hyperparameters can be defined. We therefore split the radiomics workflow into the following components: image and segmentation preprocessing (2.2.1), feature extraction (2.2.2), feature and sample preprocessing (2.2.3), and machine learning (2.2.4). For each component, we have included a collection of commonly used algorithms. An overview of the default included components, algorithms, and associated hyperparameters in the WORC framework is provided in Table 1.

Step Component Algorithm Hyperparameter Distribution
1 Feature Selection Group-wise selection Activator
Activator per group
2 Feature Imputation Selector
Mean - -
Median - -
Mode - -
Constant (zero) - -
KNN Nr. Neighbors
3 Feature Selection Variance Threshold Activator
4 Feature Scaling

Robust z-scoring

- -
5 Feature Selection RELIEF Activator
Nr. Neighbors
Sample size
Distance P
Nr. Features
6 Feature Selection SelectFromModel Activator
LASSO alpha
RF Nr. Trees
7 Dimensionality Reduction PCA Activator
8 Feature Selection Univariate testing Activator
9 Resampling Activator
RandomUnderSampling Strategy
RandomOverSampling Strategy
NearMiss Strategy
NeighborhoodCleaningRule Strategy
Nr. Neighbors
Cleaning threshold
Nr. Neighbors
ADASYN Strategy
Nr. Neighbors
10 Classification Selector
SVM Kernel
Polynomial degree
RF Nr. Trees
Min. samples / split
Max. depth
LR Regularization
LDA Solver
QDA Regularization

Gaussian Naive Bayes


Nr. Estimators

Learning rate
XGBoost Nr. Rounds
Max. depth
Learning rate
Min. child weight
% Random samples
Table 1: Overview of the algorithms and associated hyperparameter search spaces in the random search as used in the WORC framework for binary classification problems. Definitions:

: Bernoulli distribution, equaling value

with probability

; a categorical distribution over categories;

: uniform distribution;

: uniform distribution with only discrete values;

: uniform distribution on a logarithmic scale. Abbreviations: AdaBoost: adaptive boosting; ADASYN; adaptive synthetic sampling; KNN: k-nearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; PCA: principal component analysis; RBF: radial basis function; QDA: quadratic discriminant analysis; RF: random forest; SMOTE: synthetic minority oversampling technique; SVM: support vector machine; XGBoost: extreme gradient boosting.

2.2.1 Image and Segmentation Preprocessing

Before feature extraction, image preprocessing such as image quantization, normalization, resampling or noise filtering may be applied (Yip and Aerts, 2016; Parekh and Jacobs, 2016; van Griethuysen et al., 2017). By default no preprocessing is applied. The only exception is image normalization (using z-scoring), which we apply in modalities that do not have a fixed unit and scale (e.g. qualitative MRI, ultrasound), but not in modalities that have a fixed unit and scale (e.g. Computed Tomography (CT), quantitative MRI such as T1 mapping).

2.2.2 Feature Extraction

For each segmentation, 564 radiomics features quantifying intensity, shape, orientation and texture are extracted through the open-source feature toolboxes PyRadiomics (van Griethuysen et al., 2017) and PREDICT (van der Voort and Starmans, 2018). A comprehensive overview is provided in Table A.1

. Thirteen intensity features describe various first-order statistics of the raw intensity distributions within the segmentation, such as the mean, standard deviation, and kurtosis. Thirty-five shape features describe the morphological properties of the segmentation, and are extracted based only on the segmentation, i.e., not using the image. These include shape descriptions such as the volume, compactness, and circular variance. Nine orientation features describe the orientation and positioning of the segmentation, i.e., not using the image. These include the major axis orientations of a 3D ellipse fitted to the segmentation, the center of mass coordinates and indices. Lastly, 507 texture features are extracted, which include commonly used algorithms such as the Gray Level Co-occurence Matrix (GLCM) (144 features)

(Zwanenburg et al., 2020), Gray Level Size Zone Matrix (GLSZM) (16 features) (Zwanenburg et al., 2020), Gray Level Run Length Matrix (GLRLM) (16 features) (Zwanenburg et al., 2020), Gray Level Dependence Matrix (GLDM) (14 features) (Zwanenburg et al., 2020), Neighborhood Grey Tone Difference Matrix (NGTDM) (5 features) (Zwanenburg et al., 2020), Gabor filters (156 features) (Zwanenburg et al., 2020), Laplacian of Gaussian (LoG) filters (39 features) (Zwanenburg et al., 2020), and Local Binary Patterns (LBP) (39 features) (Ojala et al., 2002). Additionally, two less common feature groups are defined: based on local phase (Kovesi, 2003) (39 features) and vesselness filters (Frangi et al., 1998) (39 features).

Many radiomics studies include datasets with variations in the slice thickness due to heterogeneity in the acquisition protocols. This may cause feature values to be dependent on the acquisition protocol. Moreover, the slice thickness is often substantially larger than the pixel spacing. Hence, extracting robust 3D features may be hampered by these variations, especially for low resolutions. To overcome this issue, a 2.5D approach is used: all features except the histogram features are extracted per 2D axial slice and aggregated over all slices. Afterwards, several first-order statistics over the feature distributions are evaluated and used as actual features, see also Table A.1.

In addition to these features, depending on the application, clinical characteristics, e.g. age and sex, and manually scored features, e.g. based on visual inspection by a radiologist, can be added.

Some of the features have parameters themselves, such as the scale on which a derivative is taken. As some features are rather computationally expensive to extract, we do not include these parameters directly as hyperparameters in the CASH problem. Instead, the features are extracted for a predefined range of parameter values. In the next components, feature selection algorithms are employed to select the most relevant features and thus parameters. The used parameter ranges are reported in Table A.1.

Radiomics studies may involve multiple scans per sample, e.g. in multimodal (MRI + CT) or multi-contrast (T1-weighted MRI + T2-weighted MRI) studies. Commonly, radiomics features are defined on a single image, which also holds for the features described in this study. Hence, when multiple scans per sample are included, the 564 radiomics features are extracted per scan and concatenated.

2.2.3 Feature and Sample Preprocessing

We define feature and sample preprocessing as all algorithms that can be used between the feature extraction and machine learning components. The order of these algorithms in the WORC framework is fixed and given in Table 1.

Feature imputation is employed to replace missing feature values. Values may be missing when a feature could not be defined and computed, e.g. a lesion may be too small for a specific feature to be extracted. Algorithms for imputation include: 1) mean; 2) median; 3) mode; 4) constant value (default: zero); and 5) nearest neighbor approach.

Feature scaling is employed to ensure that all features have a similar scale. As this generally benefits machine learning algorithms, this is always performed through z-scoring. A robust version is used, where outliers, defined as feature values outside the

percentile range are excluded before computation of the mean and standard deviation.

Feature selection or dimensionality reduction algorithms may be employed to select the most relevant features and eliminate irrelevant or redundant features. As multiple algorithms may be combined, instead of defining feature selection or dimensionality reduction as a single step, each algorithm is included as a single step in the workflow with an activator hyperparameter to determine whether the algorithm is used or not.

Algorithms included are:

  1. A group-wise feature selection, in which groups of features (i.e., intensity, shape, and texture feature subgroups) can be selected or eliminated. To this end, each feature group has an activator hyperparameter. This algorithm serves as regularization, as it randomly reduces the feature set, and is therefore always used. The group-wise feature selection is the first step in the workflows, as it reduces the computation time of the other steps by reducing the feature space.

  2. A variance threshold, in which features with a low variance () are removed. This algorithm is always used, as this serves as a feature sanity check with almost zero risk of removing relevant features. The variance threshold is applied before the feature scaling, as this results in all features having unit variance.

  3. Optionally, the RELIEF algorithm (Urbanowicz et al., 2018), which ranks the features according to the differences between neighboring samples. Features with more differences between neighbors of different classes are considered higher in rank.

  4. Optionally, feature selection using a machine learning model (Fonti and Belitser, 2017). Features are selected based on their importance as given by a machine learning model trained on the dataset. Hence, the used algorithm should be able to give the features an importance weight. Algorithms included are LASSO, logistic regression, and random forest.

  5. Optionally, principal component analysis (PCA), in which either only those linear combinations of features are kept which explained 95% of the variance in the features, or a fixed number of components (10, 50, or 100) is selected.

  6. Optionally, individual feature selection through univariate testing. To this end, for each feature, a Mann-Whitney U test is performed to test for significant differences in distribution between the classes. Afterwards, only features with p-values below a certain threshold are selected. The (non-parametric) Mann-Whitney U test was chosen as it makes no assumptions about the distribution of the features.

RELIEF, selection using a model, PCA, and univariate testing have a 27.5% chance to be included in a workflow in the random search, as this gives an equal chance of applying any of these or no feature selection algorithm. The feature selection algorithms may only be combined in the mentioned order in the WORC framework.

Resampling algorithms may be used, primarily to deal with class imbalances. These include various algorithms from the imbalanced-learn toolbox (Lemaitre et al., 2017): 1) random under-sampling; 2) random over-sampling; 3) near-miss resampling; 4) the neighborhood cleaning rule; 5) SMOTE (Han et al., 2005) (regular, borderline, Tomek, and the edited nearest neighbors variant); and 6) ADASYN (He et al., 2008). All algorithms can apply four out of five different resampling strategies, resampling: 1) the minority class (not for undersampling algorithms); 2) all but the minority class; 3) the majority class (not for oversampling algorithms); 4) all but the majority class; and 5) all classes.

2.2.4 Machine Learning

For machine learning, we mostly use methods from the scikit-learn toolbox (Pedregosa et al., 2011). The following classification algorithms are included: 1) logistic regression; 2) support vector machines (with a linear, polynomial, or radial basis function kernel); 3) random forests; 4) naive Bayes; 5) linear discriminant analysis; 6) quadratic discriminant analysis (QDA); 7) AdaBoost (Freund and Schapire, 1997); and 8) extreme gradient boosting (XGBoost) (Chen et al., 2015). The associated hyperparameters for each algorithm are depicted in Table 1.

2.3 Statistics

Evaluation using a single dataset is performed through a random-split cross-validation with , see Figure A.1(a) for a schematic overview. A random-split cross-validation was chosen, as it has a relatively low computational complexity while facilitating estimation of the generalization error (Picard and Cook, 1984; Nadeau and Bengio, 2003). In each iteration, the data is randomly split in 80% for training and 20% for testing in a stratified manner. In each random-split iteration, all CASH optimization is performed within the training set according to Algorithm 1 to eliminate any risk of overfitting on the test set. When a fixed, independent training and test set are used, only the second, internal random-split cross-validation with on the training set for the CASH optimization is used, see Figure A.1(b).

Performance metrics used for evaluation of the test set include the Area Under the Curve (AUC), calculated using the Receiver Operating Characteristic (ROC) curve, , sensitivity, specificity, precision, recall, accuracy, and Balanced Classification Rate (BCR) (Tharwat, 2021). When a single dataset is used, and thus a

random-split cross-validation, 95% confidence intervals of the performance metrics are constructed using the corrected resampled t-test, thereby taking into account that the samples in the cross-validation splits are not statistically independent

(Nadeau and Bengio, 2003)

. When a fixed training and test set are used, 95% confidence intervals are constructed using 1000x bootstrap resampling of the test dataset and the standard method for normal distributions (

Efron and Tibshirani (1986), table 6, method 1). ROC confidence bands are constructed using fixed-width bands (Macskassy et al., 2005).

2.4 Software Implementation

The WORC toolbox is implemented in Python3 and available open-source (Starmans et al., 2018a) under the Apache License, Version 2.0. The WORC toolbox supports Unix and Windows operating systems. Documentation on the WORC toolbox can be found online (Starmans et al., 2018b), and several tutorials are available111 Basic usage only requires the user to specify the locations of the used data (i.e., images, segmentations, ground truth). A minimal working example of the WORC toolbox interface in Python3 is shown in Algorithm A.1.

The WORC toolbox makes use of the fastr package (Achterberg et al., 2016), an automated workflow engine. fastr does not provide any actual implementation of the required (radiomics) algorithms, but serves as a computational workflow engine, which has several advantages. Firstly, fastr requires workflows to be modular and split into standardized components or tools, with standardized inputs and outputs. This nicely connects to the modular design of WORC, for which we therefore wrapped each component as a tool in fastr. Alternating between feature extraction toolboxes can be easily done by changing a single field in the WORC toolbox configuration. Second, provenance is automatically tracked by fastr to facilitate repeatability and reproducibility. Third, fastr offers support for multiple execution plugins in order to be able to execute the same workflow on different computational resources or clusters. Examples include linear execution, local threading on multiple CPUs, and SLURM (Yoo et al., 2003). Fourth, fastr is agnostic to software language. Hence, instead of restricting the user to a single programming language, algorithms (e.g. feature toolboxes) can be supplied in a variety of languages such as Python, Matlab, R and command line executables. Fifth, fastr provides a variety of import and export plugins for loading and saving data. Besides using the local file storage, these include use of XNAT (Marcus et al., 2007).

The computation time of a WORC experiment roughly scales with , , and . A high degree of parallelization for all these parameters is possible, as all workflows can be executed independent of each other. We choose to run the iterations of sequential instead of in parallel to maintain a sustainable computational load. For the iterations and samples, all workflows are run in parallel. The default experiments in this study consist of executing 500000 workflows (, , and ). On average, experiments in our study had a computation time of approximately 18 hours on a machine with 24 Intel E5-2695 v2 CPU cores. The contribution of the feature extraction to the computation time is negligible.

3 Experiments

3.1 Evaluation of default configuration on twelve different clinical applications

In order to evaluate our WORC framework, experiments were performed on twelve different clinical applications: see Table 2 for an overview of the twelve datasets, and Figure 2 for example images from each dataset. All datasets are multi-center with heterogeneity in the image acquisition protocols. For each experiment, per patient, one or more scan(s) and segmentation(s), and a ground truth label are provided. All scans were made at “baseline”, i.e., before any form of treatment or surgery. One dataset (the Glioma dataset) consists of a fixed, independent training and test set and is thus evaluated using 1000x bootstrap resampling. In the other eleven datasets, the performance is evaluated using the random-split cross-validation.

# Dataset Patients Modality Segmentation Description
1. Lipo 115 T1w MRI Tumor Distinguishing well-differentiated liposarcoma from lipoma in 116 lesions from 115 patients (Vos et al., 2019).
2. Desmoid 203 T1w MRI Tumor Differentiating desmoid-type fibromatosis from soft-tissue sarcoma (Timbergen et al., 2020).
3. Liver 186 T2w MRI Tumor Distinguishing malignant from benign primary solid liver lesions (Starmans et al., 2021b).
4. GIST 246 CT Tumor Differentiating gastrointestinal stromal tumors (GIST) from other intra-abdominal tumors in 247 lesions from 246 patients (Starmans et al., 2020b).
5. CRLM 77 CT Tumor Distinguishing replacement from desmoplastic histopathological growth patterns in colorectal liver metastases (CRLM) in 93 lesions from 77 patients (Starmans et al., 2021a).
6. Melanoma 103 CT Tumor Predicting the BRAF mutation status in melanoma lung metastases in 169 lesions from 103 patients (Angus et al., 2021).
7. HCC 154 T2w MRI Liver Distinguishing livers in which no hepatocellular carcinoma (HCC) developed from livers with HCC at first detection during screening (Starmans et al., 2020a).
8. MesFib 68 CT Surrounding mesentery Identifying patients with mesenteric fibrosis at risk of developing intestinal complications (Blazevic et al., 2021).
9. Prostate 40 T2w MRI, DWI, ADC Lesion Classifying suspected prostate cancer lesions in high-grade (Gleason ) versus low-grade (Gleason ) in 72 lesions from 40 patients (Castillo T et al., 2019).
10. Glioma 413 T1w & T2w MRI Tumor Predicting the 1p/19q co-deletion in patients with presumed low-grade glioma with a training set of 284 patients and an external validation set of 129 patients (van der Voort et al., 2019b).
11. Alzheimer 848 T1w MRI Hippocampus Distinguishing patients with Alzheimer’s disease from cognitively normal individuals in 848 subjects based on baseline T1w MRIs (Bron et al., 2021).
12. H&N 137 CT Gross tumor volume Predicting the T-stage (high () or low ()) in patients with head-and-neck cancer (Aerts et al., 2014).
Dataset publicly released as part of this study (Starmans et al., 2021c).
Table 2: Overview of the twelve datasets used in this study to evaluate our WORC framework. Abbreviations: ADC: Apparent Diffusion Coefficient; CT: Computed Tomography; DWI: Diffusion Weighted Imaging; MRI: Magnetic Resonance Imaging; T1w: T1 weighted; T2w: T2 weighted.
Figure 2: Examples of the 2D slices from the 3D imaging data from the twelve datasets used in this study to evaluate our WORC framework For each dataset, for one patient of each of the two classes, the 2D slice in the primary scan direction (e.g., axial) with the largest area of the segmentation is depicted; the boundary of the segmentation is projected in color on the image. The datasets included were from different clinical applications: a. lipomatous tumors (Vos et al., 2019); b. desmoid-type fibromatosis (Timbergen et al., 2020); c. primary solid liver tumors (Starmans et al., 2021b); d. gastrointestinal stromal tumors (Starmans et al., 2020b); e. colorectal liver metastases (Starmans et al., 2021a); f. melanoma (Angus et al., 2021); g. hepatocellular carcinoma (Starmans et al., 2020a); h. mesenteric fibrosis (Blazevic et al., 2021); i. prostate cancer (Castillo T et al., 2019); j. low grade glioma (van der Voort et al., 2019b); k. Alzheimer’s disease (Bron et al., 2021); and l. head and neck cancer (Aerts et al., 2014).

The first six datasets (Lipo, Desmoid, Liver, GIST, CRLM, and Melanoma) are publicly released as part of this study, see (Starmans et al., 2021c) for more details. Three datasets (HCC, MesFib, and Prostate) cannot be made publicly available. The final three datasets (Glioma, Alzheimer, and H&N) are already publicly available, and were described in previous studies (van der Voort et al., 2019b; Jack et al., 2015; Aerts et al., 2014).

For the Glioma dataset, the raw imaging data was not available. Instead, pre-computed radiomics features are available (van der Voort et al., 2019a), which were directly fed into WORC.

The Alzheimer dataset was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database ( The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI was to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see This dataset will be referred to as the “Alzheimer” dataset. Here, radiomics was used to distinguish patients with AD from cognitively normal (CN) individuals. The cohort as described by Bron et al. (2021) was used, which includes 334 patients with AD and 520 CN individuals with approximately the same mean age in both groups (AD: 74.9 years, CD: 74.2 years). The hippocampus was chosen as region of interest for the radiomics analysis, as this region is known to suffer from atrophy early in the disease process of AD. Automatic hippocampus segmentations were obtained for each patient using the algorithm described by Bron et al. (2014).

The H&N dataset (Aerts et al., 2014) was obtained from a public database222 and directly fed into WORC. For each lesion, the first gross tumor volume (GTV-1) segmentation was used as region of interest for the radiomics analysis. Patients without a CT scan or a GTV-1 segmentation were excluded.

3.2 Influence of the number of random search iterations and ensemble size

An additional experiment was conducted to investigate the influence of the number of random search iterations and ensemble size on the performance. For reproducibility, this experiment was performed using the six datasets publicly released in this study (Lipo, Desmoid, Liver, GIST, CRLM, and Melanoma). We hypothesize that increasing at first will improve the performance and stability, and after some point, when the ratio / becomes too high, will reduce the performance and stability as bad solutions are added to the ensemble.

We varied the number of random search iterations () and the ensemble size () and repeated each experiment ten times with different seeds for the random number generator. To limit the computational burden, was used instead of the default , and the experiment was only performed once instead of ten times. For each configuration, both the average performance and the stability were assessed in terms of the mean and standard deviation of . Based on these experiments, the default number of random search iterations and ensemble size for the WORC optimization algorithm were determined and used in all other experiments.

4 Results

4.1 Application of the Worc framework to twelve datasets

Error plots of the AUCs from the application of our WORC framework with the same default configuration on the twelve different datasets are shown in Figure 3; detailed performances, including other metrics, are shown in Table 3; the ROC curves are shown in Figure A.2. In eleven of the twelve datasets, we successfully found a prediction model, with mean AUCs of 0.83 (Lipo), 0.82 (Desmoid), 0.81 (Liver), 0.77 (GIST), 0.68 (CRLM), 0.75 (HCC), 0.81 (MesFib), 0.72 (Prostate), 0.70 (Glioma), 0.87 (Alzheimer), and 0.84 (H&N). In the Melanoma dataset, the mean AUC (0.51) was similar to that of guessing (0.50).

Figure 3:

Error plots of the area under the receiver operating characteristic curve (AUC) of the radiomics models on twelve datasets. The error plots represent the 95% confidence intervals, estimated through

random-split cross-validation on the entire dataset (all except Glioma) or through 1000x bootstrap resampling of the independent test set (Glioma). The circle represents the mean (all except Glioma) or point estimate (Glioma), which is also stated to the right of each circle. The dashed line corresponds to the AUC of random guessing (0.50).
Dataset Lipo Desmoid Liver GIST CRLM Melanoma
AUC 0.83 [0.75, 0.91] 0.82 [0.75, 0.87] 0.81 [0.74, 0.87] 0.77 [0.71, 0.83] 0.68 [0.56, 0.80] 0.51 [0.41, 0.62]
BCR 0.74 [0.66, 0.82] 0.73 [0.66, 0.79] 0.72 [0.65, 0.80] 0.70 [0.65, 0.76] 0.62 [0.51, 0.72] 0.51 [0.42, 0.60]
0.73 [0.65, 0.82] 0.76 [0.69, 0.82] 0.72 [0.65, 0.79] 0.70 [0.65, 0.75] 0.61 [0.51, 0.72] 0.50 [0.41, 0.59]
Sensitivity 0.72 [0.60, 0.84] 0.59 [0.46, 0.72] 0.74 [0.64, 0.84] 0.67 [0.58, 0.76] 0.60 [0.44, 0.75] 0.53 [0.38, 0.67]
Specificity 0.75 [0.63, 0.88] 0.87 [0.80, 0.93] 0.71 [0.60, 0.83] 0.73 [0.65, 0.81] 0.64 [0.47, 0.80] 0.49 [0.36, 0.62]
Dataset HCC MesFib Prostate Glioma Alzheimer H&N
AUC 0.75 [0.67, 0.83] 0.81 [0.72, 0.91] 0.72 [0.61, 0.82] 0.70 [0.62, 0.81] 0.87 [0.85, 0.90] 0.84 [0.77, 0.91]
BCR 0.69 [0.61, 0.78] 0.72 [0.62, 0.82] 0.67 [0.57, 0.78] 0.62 [0.55, 0.69] 0.78 [0.75, 0.80] 0.75 [0.67, 0.82]
0.69 [0.60, 0.78] 0.72 [0.61, 0.83] 0.67 [0.56, 0.78] 0.53 [0.43, 0.63] 0.79 [0.77, 0.82] 0.75 [0.67, 0.83]
Sensitivity 0.72 [0.59, 0.85] 0.78 [0.61, 0.94] 0.67 [0.49, 0.85] 0.36 [0.25, 0.47] 0.69 [0.64, 0.75] 0.79 [0.68, 0.90]
Specificity 0.67 [0.54, 0.80] 0.67 [0.49, 0.85] 0.68 [0.53, 0.82] 0.89 [0.79, 0.98] 0.86 [0.83, 0.89] 0.71 [0.58, 0.83]
Table 3: Classification results of our WORC framework on the twelve datasets. For all metrics, the mean and 95% confidence intervals (CIs) are reported. Abbreviations: AUC: area under the receiver operating characteristic curve; BCR: balanced classification rate (Tharwat, 2021); : weighted F1-score. : 95% CI constructed through a random-split cross-validation; : 95% CI constructed through a 1000x bootstrap resampling of the test set.

4.2 Influence of the number of random search iterations and ensemble size

The performance of varying the number of random search iterations and ensemble size in the first six datasets is reported in Table 4.

For five datasets of the six datasets in this experiment (Lipo, Desmoid, Liver, GIST, and CRLM), the mean performance generally improved when increasing both and . The sixth dataset (Melanoma) is an exception, as the performances for varying and was similar. This can be attributed to the fact that it is the only dataset in this study where we could not successfully construct a productive model.

In the first five datasets, the mean for the lowest values, (i.e., only trying one random workflow) and (i.e., no ensembling), was 0.75 (Lipo), 0.61 (Desmoid), 0.66 (Liver), 0.67 (GIST), and 0.54 (CRLM). The mean performance for the highest values, and , was substantially higher for all five datasets (Lipo: 0.84; Desmoid: 0.72; Liver: 0.80; GIST: 0.76; and CRLM: 0.63). The mean of was very similar to that of , while took 25 times longer to execute than . This indicates that at some point, here , increasing the computation time by trying out more workflows does not result in an increase in performance on the test set anymore.

At and , the standard deviation of the (Lipo: 0.026; Desmoid: 0.023; Liver: 0.022; GIST: 0.038; and CRLM: 0.027) was substantially higher than at , (Lipo: 0.001; Desmoid: 0.004; Liver: 0.002; GIST: 0.002; and CRLM: 0.005). This indicates that increasing and improves the stability of the model. The standard deviations of were similar to , illustrating that, similar to the mean performance, the stability at some point converges. For each value, the standard deviation at first decreased when increasing , but increased when became similar or equal to .

Mean Std Mean Std Mean Std Mean Std Mean Std Mean
0.754 0.026 0.772 0.021 0.790 0.026 0.784 0.016 0.790 0.012 0.800
0.771 0.007 0.819 0.015 0.833 0.005 0.841 0.004 0.830 0.004 0.830
- - 0.801 0.008 0.815 0.004 0.855 0.002 0.843 0.002 0.836
- - - - 0.808 0.006 0.853 0.002 0.848 0.001 0.842
Mean Std Mean Std Mean Std Mean Std Mean Std Mean
1 0.607 0.023 0.621 0.020 0.612 0.024 0.634 0.018 0.679 0.012 0.689
0.660 0.020 0.701 0.012 0.690 0.013 0.697 0.015 0.706 0.010 0.719
- - 0.696 0.012 0.709 0.008 0.712 0.008 0.715 0.004 0.717
- - - - 0.699 0.005 0.717 0.005 0.719 0.004 0.717
Mean Std Mean Std Mean Std Mean Std Mean Std Mean
0.661 0.022 0.703 0.027 0.713 0.019 0.743 0.023 0.773 0.008 0.778
0.709 0.016 0.755 0.015 0.762 0.011 0.792 0.007 0.805 0.005 0.806
- - 0.753 0.013 0.767 0.005 0.797 0.004 0.801 0.003 0.807
- - - - 0.766 0.006 0.793 0.003 0.798 0.002 0.803
Mean Std Mean Std Mean Std Mean Std Mean Std Mean
0.668 0.038 0.709 0.023 0.712 0.023 0.725 0.019 0.735 0.010 0.733
0.683 0.018 0.744 0.017 0.749 0.009 0.758 0.008 0.756 0.003 0.763
- - 0.717 0.019 0.738 0.008 0.764 0.002 0.762 0.002 0.762
- - - - 0.725 0.009 0.766 0.002 0.761 0.002 0.761
Mean Std Mean Std Mean Std Mean Std Mean Std Mean
0.545 0.027 0.572 0.038 0.555 0.025 0.589 0.026 0.583 0.015 0.591
0.586 0.025 0.611 0.014 0.619 0.011 0.621 0.010 0.625 0.008 0.615
- - 0.620 0.014 0.635 0.013 0.633 0.008 0.626 0.005 0.635
- - - - 0.633 0.013 0.639 0.007 0.621 0.005 0.633
Mean Std Mean Std Mean Std Mean Std Mean Std Mean
0.500 0.018 0.509 0.020 0.506 0.015 0.528 0.021 0.546 0.020 0.552
0.489 0.018 0.506 0.016 0.508 0.011 0.522 0.015 0.539 0.011 0.553
- - 0.488 0.011 0.495 0.008 0.520 0.009 0.534 0.005 0.537
- - - - 0.490 0.010 0.513 0.004 0.529 0.004 0.536
Table 4: Mean and standard deviation (Std) for the weighted F1-score when ten times repeating experiments with varying number of random search iterations () and ensemble size () on six different datasets (Lipo, Desmoid, Liver, GIST, CRLM, and Melanoma). The color coding of the mean indicates the relative performance on each dataset (green: high; red: low); the color coding of the standard deviation indicates the relative variation on each dataset (dark: high; light: low).

5 Discussion

In this study, we proposed a framework to automatically construct and optimize radiomics workflows to generalize radiomics across applications. To evaluate the performance and generalization, we applied our framework to twelve different, independent clinical applications, while using the exact same configuration. We were able to find a classification model in eleven applications, indicating that our WORC framework can be used to automatically find radiomics signatures in various clinical applications.

The increase in radiomics studies in recent years has led to a wide variety of radiomics algorithms and related software implementations (Rizzo et al., 2018; Song et al., 2020). For a new clinical application, finding a suitable radiomics workflow has to be done manually, which is a tedious and time consuming process lacking reproducibility. We exploited advances in automated machine learning in order to fully automatically construct complete radiomics workflows from a large search space of radiomics components, including image preprocessing, feature calculation, feature and sample preprocessing, and machine learning algorithms. Hence, our WORC framework streamlines the construction and optimization of radiomics workflows in new applications, and thus facilitates probing datasets for radiomics signatures.

In the field of radiomics, there is a lack of reproducibility, while this is vital for the transition of radiomics models to clinical practice (Song et al., 2020; Traverso et al., 2018). A recent study (dos Santos et al., 2021) even warned that radiomics research must achieve “higher evidence levels” to avoid a reproducibility crisis such as the recent one in psychology (Open Science Collaboration, 2015). Hence, to facilitate reproducibility, besides automating the radiomics workflows construction, we have publicly released six datasets with a total of 930 patients (Starmans et al., 2021c), and made the WORC toolbox and the code to perform our experiments on all datasets open-source (Starmans et al., 2018a; Starmans, 2021). Besides a lack of reproducibility, there is a positive publication bias in radiomics, with as few as 6% of the studies between 2015 and 2018 showing negative results as reported by Buvat and Orlhac (2019). They indicate that, to overcome this bias, sound methodology, robustness, reproducibility, and standardization are key. By addressing these factors in our study, including extensive validation of our framework on twelve different clinical applications, we hope to contribute to overcoming the challenges for publishing negative results.

From the twelve dataset included in this study, the melanoma dataset is the only dataset for which we were not able to find a biomarker, which is studied in detail in Angus et al. (2021). Additionally, Angus et al. (2021) shows that scoring by a radiologist also led to a negative result. This validates our framework, showing that it does not invent a relation when one does not exist.

Several initiatives towards standardization of radiomics have been formed. The Radiomics Quality Score (RQS) was defined to assess the quality of radiomics studies (Lambin et al., 2017). While the RQS provides guidelines for the overall study evaluation and reporting, it does not provide standardization of the radiomics workflows or algorithms themselves. The Imaging Biomarker Standardization Initiative (IBSI) (Hatt et al., 2018; Zwanenburg et al., 2020) provides guidelines for the radiomics feature extraction component and standardization for a set of 174 features (we use 564 features by default, of which a part is included in IBSI). In this study, we complement these important initiatives by addressing the standardization of the radiomics workflow itself.

Related to this work, AutoML has previously been used in radiomics using Tree Based Optimization Tool (TPOT) (Olson and Moore, 2019) by Su et al. (2019) to predict the H3 K27M mutation in midline glioma and Sun et al. (2019) to predict invasive placentation. These studies are examples of using AutoML to optimize the machine learning component of radiomics in two specific applications. In this study, we streamlined the construction and optimization of the complete radiomics workflow, included a large collection of commonly used radiomics algorithms and algorithms in the search space, and extensively validated our approach and evaluated its generalizability in twelve different applications. Additionally, our work shows similarities with the Medical Segmentation Decathlon (MSD) (Antonelli et al., 2021), in which algorithms were compared on a multitude of segmentation tasks. To this end, the MSD provided data representative of various challenges in medical imaging and created a framework for benchmarking segmentation algorithms and evaluating their generalizabily. Although not in a challenge design, our contributions are similar, but on a different task, as we focus on radiomics, i.e., classification of clinical outcomes, instead of segmentation. Moreover, besides comparing a large collection of algorithms, we optimized combining them in a radiomics prediction model using AutoML and ensembling.

The field of medical deep learning faces several similar challenges to conventional radiomics (Afshar et al., 2019; Parekh and Jacobs, 2019; Avanzo et al., 2020; Bodalal et al., 2019): a lack of standardization, a wide variety of available algorithms, and the need for tuning of model selection and hyperparameters per application. The same problem thus persists: on a given application, from all available deep learning algorithms, how to find the optimal (combination of) workflows? Here, we showed that automated machine learning may be used to streamline this process for conventional radiomics algorithms. Hence, future research may include a similar framework to WORC to facilitate construction and optimization of deep learning workflows, including the full workflow from image to prediction, or a hybrid approach combining deep learning and conventional radiomics. In the field of computer science, the automatic deep learning model selection is addressed in Neural Architecture Search (NAS) (Elsken et al., 2018), which is currently a hot topic in the field of AutoML (Escalante et al., 2021). In deep learning for medical imaging, NAS is still at an early stage, and the available algorithms mostly focus on segmentation (Mao et al., 2021). While the main concept of our framework, i.e., the CASH optimization, could be applied in a similar fashion for deep learning, this poses several challenges. First, deep learning models generally take a lot longer to train, in the order of hours or even days, compared to less than a second for conventional machine learning methods. Our extensive optimization and cross-validation setup is therefore not feasible. Second, the deep learning search space is less clear due to the wide variety of design options, while conventional radiomics workflows typically follow the same steps. Lastly, while current NAS approaches mostly focus on architectural design hyperparameters, pre- and post-processing choices may be equally important to include in the search space (Isensee et al., 2021). Most NAS methods jointly optimize the network hyperparameters and weights through gradient based optimizations. As the pre- and post-processing are performed outside of the network and require selector type hyperparameters, combined optimization with the architectural design options is not trivial.

The two main components of the WORC optimization algorithm are the random search and the ensemble. Our results show that, in line with our hypothesis, increasing at first improves both the performance and the stability of the resulting models. However, as we also hypothesized, when the ratio becomes to large, the performance and stability decrease. On the six datasets in this experiment, the performance and stability at was similar to that at , while the computation time does increase. Therefore, was chosen as the default in the WORC optimization algorithm, together with to have an optimal ratio.

For the three previously publicly released datasets from other studies, we compared the performance of our WORC framework to that of the original studies. In the Glioma dataset, our performance (AUC of 0.70) was similar to the original study (van der Voort et al. (2019b): AUC of 0.72). We thus showed that that our framework was able to successfully construct a signature using an external set of features. Moreover, as the Glioma dataset consists of a separate training and external validation set, we also verified the external-validation setup (Figure A.1 b). In the Alzheimer dataset, our performance (AUC of 0.87) was also similar to the original study (Bron et al. (2021): AUC range of 0.80 - 0.94, depending on the level of preprocessing). However, Bron et al. (2021) used whole-brain voxel-wise features, while we used radiomics features extracted from the hippocampus only. We may therefore have missed information from other brain regions, having a negative effect on the performance in our study. On the H&N dataset, Aerts et al. (2014) did not evaluate the prognostic value of radiomics for predicting the T-stage, but rather the association through the concordance index (0.69). Moreover, Aerts et al. (2014) trained a model on a separate dataset of patients with a different clinical application (lung tumors) and externally validated the signature on the H&N dataset, while we performed an internal cross-validation on the H&N dataset. As the lung dataset is not publicly available (anymore), the original experimental setup could not be replicated. Hence the results cannot be directly compared. Concluding, to the extent possible when comparing the results, our WORC framework showed a similar performance as the original studies.

In principle, in any radiomics application, our WORC framework can be used to construct and optimize the radiomics workflow. However, there is a trade-off between the brute-force optimization of our WORC algorithm versus using prior (domain) knowledge to develop a “logical” algorithm. Nonetheless, even in a small search space, deciding purely based on prior knowledge which algorithm will be optimal is complex and generally not feasible. Therefore, we suggest to use domain knowledge to reduce the search space, as it may be possible on certain applications to determine which algorithms a priori have a (near) zero chance of succeeding. The WORC optimization algorithm can be used to construct and optimize the radiomics workflow within the remaining search space. Moreover, when the optimal solution is expected to not be included in the default WORC search space and thus a new radiomics method is proposed, this can be added to our framework in a straightforward manner. This facilitates systematic comparison of the new method with the existing, already included methods, and combining the new method with (parts of) the existing methods to optimize the radiomics workflow and increase the overall performance.

Future research could include, firstly, the use of more advanced optimization strategies to improve the performance. Generally, random search, as we use in the WORC optimization algorithm, serves as a solid baseline for optimization problems (Bergstra and Bengio, 2012). However, there is no guarantee that the optimum has been found, and the result may differ when repeating an experiment. The original study introducing CASH used Bayesian optimization, which may overcome these issues (Thornton et al., 2013)

. Other strategies include multi-fidelity optimization (e.g. bandits), genetic or evolutionary algorithms

(Su et al., 2019), or gradient based optimization (Hutter et al., 2019). However, the hyperparameter space in the WORC framework is relatively large due to the inclusion of multiple (optional) algorithm collections instead of just one, making optimization more complex and computationally expensive. Moreover, optimizing the performance further on the validation set may result in overfitting (Hutter et al., 2019), therefore actually resulting in worse generalization. Secondly, as we evaluated our framework on twelve different datasets, when applying WORC on a new dataset, meta-learning could be used to learn from the results on these previous twelve datasets (Hutter et al., 2019). Especially on smaller datasets, taking into account which solutions worked best on previous datasets may improve the performance and lower the computation time. Thirdly, future research into the use of more advanced ensembling strategies may also improve the performance and stability (Bonab and Can, 2019). Lastly, our framework may be used on other clinical applications to automatically optimize radiomics workflows. While we only showed the use of our framework on CT and MRI, the used features have also been shown to be successful in other modalities such as PET (Yang et al., 2020) and ultrasound (Yu et al., 2019), and thus the WORC framework could also be useful in these modalities.

6 Conclusions

In this study, we proposed a framework for the fully automatic construction and optimization of radiomics workflows to generalize radiomics across applications. The framework was validated on twelve different, independent clinical applications, on eleven of which our framework automatically constructed a successful radiomics model. On the three datasets of these that were previously publicly released and analyzed with different methods, we achieved a similar performance as that of the original studies. Hence, our framework may be used to streamline the construction and optimization of radiomics workflows on new applications, and thus for probing datasets for radiomics signatures. By releasing six datasets publicly, and the WORC toolbox implementing our framework and the code to reproduce the experiments of this study open-source, we aim to facilitate reproducibility and validation of radiomics algorithms.

Data Statement

Six of the datasets used in this study (Lipo, Desmoid, Liver, GIST, CRLM, and Melanoma), comprising a total of 930 patients, are publicly released as part of this study and hosted via a public XNAT333 as published in (Starmans et al., 2021c). By storing all data on XNAT in a structured and standardized manner, experiments using these datasets can be easily executed at various computational resources with the same code.

Three datasets were already publicly available as described in section 3. The other three datasets could not be made publicly available. The code for the experiments on the nine publicly available datasets is available on GitHub (Starmans, 2021).


The authors thank Laurens Groenendijk for his assistance in processing the data and in the anonimization procedures, and Hakim Achterberg for his assistance in the development of the software. Martijn P. A. Starmans and Jose M. Castillo T. acknowledge funding from the research program STRaTeGy with project numbers 14929, 14930, and 14932, which is (partly) financed by the Netherlands Organization for Scientific Research (NWO). Sebastian R. van der Voort acknowledges funding from the Dutch Cancer Society (KWF project number EMCR 2015-7859). Part of this study was financed by the Stichting Coolsingel (reference number 567), a Dutch non-profit foundation. This study is supported by EuCanImage (European Union’s Horizon 2020 research and innovation programme under grant agreement Nr. 952103). This work was partially carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.

Data collection and sharing for this project was partially funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health ( The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.


This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Competing Interests Statement

Wiro J. Niessen is founder, scientific lead, and shareholder of Quantib BV. Jacob J. Visser is a medical advisor at Contextflow. Astrid A. M. van der Veldt is a consultant (fees paid to the institute) at BMS, Merck, MSD, Sanofi, Eisai, Pfizer, Roche, Novartis, Pierre Fabre and Ipsen. The other authors do not declare any conflicts of interest.

CRediT Author Statement

M.P.A.S., W.J.N., and S.K. provided the conception and design of the study. M.P.A.S., M.J.M.T., M.V., G.A.P., W.K., D.H., D.J.G., C.V., S.S., R.S.D., C.J.E., F.F., G.J.L.H.v.L., A.B., J.H, T.B., R.v.G., G.J.H.F., R.A.F., W.W.d.H., F.E.B., F.E.J.A.W., B.G.K., L.A., A.A.M.v.d.V., A.R., A.E.O., J.M.C.T., J.V., I.S., M.R., Mic.D., R.d.M., J.IJ., R.L.M., P.B.V., E.E.B., M.G.T., and J.J.V. acquired the data. M.P.A.S., S.R.v.d.V., M.J.M.T., M.V., A.B., F.E.B., L.A., Mit.D., J.M.C.T., R.L.M., E.B., M.G.T. and S.K. analyzed and interpreted the data. M.P.A.S., S.R.v.d.V., T.P., and Mit.D. created the software. M.P.A.S. and S.K. drafted the article. All authors read and approved the final manuscript.

Ethics Statement

The protocol of this study conformed to the ethical guidelines of the 1975 Declaration of Helsinki. Approval by the local institutional review board of the Erasmus MC (Rotterdam, the Netherlands) was obtained for collection of the WORC database (MEC-2020-0961), and separately for eight of the studies using in-house data (Lipo: MEC-2016-339, Desmoid: MEC-2016-339, Liver: MEC-2017-1035, GIST: MEC-2017-1187, CRLM: MEC-2017-479, Melanoma: MEC-2019-0693, HCC: MEC-2018-1621, Prostate: NL32105.078.10). The need for informed consent was waived due to the use of anonymized, retrospective data. For the last study involving in-house data, the Mesfib study, as the study was retrospectively performed with anonymized data, no approval from the ethical committee or informed consent was required.



  • Achterberg et al. (2016) Achterberg, H.C., Koek, M., Niessen, W.J., 2016. Fastr: A workflow engine for advanced data flows in medical image analysis. Frontiers in ICT 3, 15. URL:, doi:10.3389/fict.2016.00015.
  • Aerts (2016) Aerts, H.J.W.L., 2016. The potential of radiomic-based phenotyping in precision medicine. JAMA Oncology 2, 1636. URL:, doi:10.1001/jamaoncol.2016.2631.
  • Aerts et al. (2014) Aerts, H.J.W.L., Velazquez, E.R., Leijenaar, R.T.H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M.M., Leemans, C.R., Dekker, A., Quackenbush, J., Gillies, R.J., Lambin, P., 2014. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Communications 5, 4006. URL:, doi:10.1038/ncomms5006.
  • Afshar et al. (2019) Afshar, P., Mohammadi, A., Plataniotis, K.N., Oikonomou, A., Benali, H., 2019. From handcrafted to deep-learning-based cancer radiomics: Challenges and opportunities. IEEE Signal Processing Magazine 36, 132–160. URL:, doi:10.1109/msp.2019.2900993.
  • Angus et al. (2021) Angus, L., Starmans, M.P.A., Rajicic, A., Odink, A.E., Jalving, M., Niessen, W.J., Visser, J.J., Sleijfer, S., Klein, S., van der Veldt, A.A.M., 2021. The BRAF P.V600E mutation status of melanoma lung metastases cannot be discriminated on computed tomography by LIDC criteria nor radiomics using machine learning. Journal of Personalized Medicine 11, 257. URL:, doi:10.3390/jpm11040257.
  • Antonelli et al. (2021) Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., MSummers, R., van Ginneken, B., Bilello, M., Bilic, P., Christ, P.F., Do, R.K.G., Gollub, M.J., Heckers, S.H., Huisman, H., Jarnagin, W.R., McHugo, M.K., Napel, S., Pernicka, J.S.G., Rhode, K., Tobon-Gomez, C., Vorontsov, E., Huisman, H., Meakin, J.A., Ourselin, S., Wiesenfarth, M., Arbelaez, P., Bae, B., Chen, S., Daza, L., Feng, J., He, B., Isensee, F., Ji, Y., Jia, F., Kim, N., Kim, I., Merhof, D., Pai, A., Park, B., Perslev, M., Rezaiifar, R., Rippel, O., Sarasua, I., Shen, W., Son, J., Wachinger, C., Wang, L., Wang, Y., Xia, Y., Xu, D., Xu, Z., Zheng, Y., Simpson, A.L., Maier-Hein, L., Cardoso, M.J., 2021. The medical segmentation decathlon URL:
  • Apte et al. (2018) Apte, A.P., Iyer, A., Crispin-Ortuzar, M., Pandya, R., van Dijk, L.V., Spezi, E., Thor, M., Um, H., Veeraraghavan, H., Oh, J.H., Shukla-Dave, A., Deasy, J.O., 2018. Technical note: Extension of CERR for computational radiomics: A comprehensive MATLAB platform for reproducible radiomics research. Medical Physics 45, 3713–3720. URL:, doi:10.1002/mp.13046.
  • Avanzo et al. (2020) Avanzo, M., Wei, L., Stancanello, J., Vallières, M., Rao, A., Morin, O., Mattonen, S.A., Naqa, I.E., 2020. Machine and deep learning methods for radiomics. Medical Physics 47, e185–e202. URL:, doi:10.1002/mp.13678.
  • Bergstra and Bengio (2012) Bergstra, J., Bengio, Y., 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305.
  • Blazevic et al. (2021) Blazevic, A., Starmans, M.P.A., Brabander, T., Dwarkasing, R.S., van Gils, R.A.H., Hofland, J., Franssen, G.J.H., Feelders, R.A., Niessen, W.J., Klein, S., de Herder, W.W., 2021. Predicting symptomatic mesenteric mass in small intestinal neuroendocrine tumors using radiomics. Endocrine-Related Cancer 28, 529–539. URL:, doi:10.1530/erc-21-0064.
  • Bodalal et al. (2019) Bodalal, Z., Trebeschi, S., Nguyen-Kim, T.D.L., Schats, W., Beets-Tan, R., 2019. Radiogenomics: bridging imaging and genomics. Abdominal Radiology 44, 1960–1984. URL:, doi:10.1007/s00261-019-02028-w.
  • Bonab and Can (2019) Bonab, H., Can, F., 2019. Less is more: A comprehensive framework for the number of components of ensemble classifiers.

    IEEE Transactions on Neural Networks and Learning Systems 30, 2735–2745.

    URL:, doi:10.1109/tnnls.2018.2886341.
  • Bron et al. (2021) Bron, E.E., Klein, S., Papma, J.M., Jiskoot, L.C., Venkatraghavan, V., Linders, J., Aalten, P., De Deyn, P.P., Biessels, G.J., Claassen, J.A.H.R., Middelkoop, H.A.M., Smits, M., Niessen, W.J., van Swieten, J.C., van der Flier, W.M., Ramakers, I.H.G.B., van der Lugt, A., 2021. Cross-cohort generalizability of deep and conventional machine learning for MRI-based diagnosis and prediction of Alzheimer’s disease. NeuroImage: Clinical 31, 102712. URL:, doi:10.1016/j.nicl.2021.102712.
  • Bron et al. (2014) Bron, E.E., Steketee, R.M., Houston, G.C., Oliver, R.A., Achterberg, H.C., Loog, M., van Swieten, J.C., Hammers, A., Niessen, W.J., Smits, M., Klein, S., for the Alzheimer’s Disease Neuroimaging Initiative, 2014. Diagnostic classification of arterial spin labeling and structural MRI in presenile early stage dementia. Human Brain Mapping 35, 4916–4931. URL:, doi:10.1002/hbm.22522.
  • Buvat and Orlhac (2019) Buvat, I., Orlhac, F., 2019. The dark side of radiomics: On the paramount importance of publishing negative results. Journal of Nuclear Medicine 60, 1543–1544. URL:, doi:10.2967/jnumed.119.235325.
  • Castillo T et al. (2019) Castillo T, J.M., Starmans, M.P.A., Niessen, W.J., Schoots, I., Klein, S., Veenland, J.F., 2019. Classification of prostate cancer: High grade versus low grade using a radiomics approach, in: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Institute of Electrical and Electronics Engineers (IEEE). pp. 1319–1322. URL:, doi:10.1109/isbi.2019.8759217.
  • Chan and Ginsburg (2011) Chan, I.S., Ginsburg, G.S., 2011. Personalized medicine: Progress and promise. Annual Review of Genomics and Human Genetics 12, 217–244. URL:, doi:10.1146/annurev-genom-082410-101446.
  • Chen et al. (2015) Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., 2015. Xgboost: extreme gradient boosting. R package version 0.4-2 , 1–4.
  • Efron and Tibshirani (1986) Efron, B., Tibshirani, R., 1986.

    [bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy]: Rejoinder.

    Statistical Science 1, 54–75. URL:, doi:10.1214/ss/1177013817.
  • Elsken et al. (2018) Elsken, T., Metzen, J.H., Hutter, F., 2018. Neural architecture search: A survey. arXiv:1808.05377 .
  • Escalante et al. (2021) Escalante, H.J., Yao, Q., Tu, W.W., Pillay, N., Qu, R., Yu, Y., Houlsby, N., 2021. Guest editorial: Automated machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 2887–2890. doi:10.1109/TPAMI.2021.3077106.
  • Fang et al. (2014) Fang, Y.H.D., Lin, C.Y., Shih, M.J., Wang, H.M., Ho, T.Y., Liao, C.T., Yen, T.C., 2014. Development and evaluation of an open-source software package cgita for quantifying tumor heterogeneity with molecular images. URL:, doi:10.1155/2014/248505.
  • Feurer et al. (2019) Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.T., Blum, M., Hutter, F., 2019. Auto-sklearn: Efficient and robust automated machine learning, in: Automated Machine Learning, Springer Science and Business Media LLC. pp. 113–134. URL:, doi:10.1007/978-3-030-05318-5_6.
  • Fonti and Belitser (2017) Fonti, V., Belitser, E.N., 2017. Feature selection using LASSO.
  • Frangi et al. (1998) Frangi, A.F., Niessen, W.J., Vincken, K.L., Viergever, M.A., 1998. Multiscale vessel enhancement filtering, in: Wells, W.M., Colchester, A., Delp, S. (Eds.), Medical Image Computing and Computer-Assisted Intervention — MICCAI’98, Springer Science and Business Media LLC. pp. 130–137. URL:, doi:10.1007/bfb0056195.
  • Freund and Schapire (1997) Freund, Y., Schapire, R.E., 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139. URL:, doi:10.1006/jcss.1997.1504.
  • van Griethuysen et al. (2017) van Griethuysen, J.J., Fedorov, A., Parmar, C., Hosny, A., Aucoin, N., Narayan, V., Beets-Tan, R.G., Fillion-Robin, J.C., Pieper, S., Aerts, H.J., 2017. Computational radiomics system to decode the radiographic phenotype. Cancer Research 77, e104–e107. URL:, doi:10.1158/0008-5472.can-17-0339.
  • Guiot et al. (2021) Guiot, J., Vaidyanathan, A., Deprez, L., Zerka, F., Danthine, D., Frix, A.N., Lambin, P., Bottari, F., Tsoutzidis, N., Miraglio, B., Walsh, S., Vos, W., Hustinx, R., Ferreira, M., Lovinfosse, P., Leijenaar, R.T., 2021. A review in radiomics: Making personalized medicine a reality via routine imaging. Medicinal Research Reviews n/a, med.21846. URL:, doi:10.1002/med.21846.
  • Hamburg and Collins (2010) Hamburg, M.A., Collins, F.S., 2010. The path to personalized medicine. New England Journal of Medicine 363, 301–304. URL:, doi:10.1056/nejmp1006304.
  • Han et al. (2005) Han, H., Wang, W.Y., Mao, B.H., 2005. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in: Huang, D.S., Zhang, X.P., Huang, G.B. (Eds.), Lecture Notes in Computer Science, Springer Science and Business Media LLC. pp. 878–887. URL:, doi:10.1007/11538059_91.
  • Hatt et al. (2018) Hatt, M., Vallieres, M., Visvikis, D., Zwanenburg, A., 2018. IBSI: an international community radiomics standardization initiative. Journal of Nuclear Medicine 59, 287–287.
  • He et al. (2008) He, H., Bai, Y., Garcia, E.A., Li, S., 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Institute of Electrical and Electronics Engineers (IEEE). pp. 1322–1328. URL:, doi:10.1109/ijcnn.2008.4633969.
  • Hood and Friend (2011) Hood, L., Friend, S.H., 2011. Predictive, personalized, preventive, participatory (P4) cancer medicine. Nature Reviews Clinical Oncology 8, 184–187. URL:, doi:10.1038/nrclinonc.2010.227.
  • Hosseini et al. (2020) Hosseini, M., Powell, M., Collins, J., Callahan-Flintoft, C., Jones, W., Bowman, H., Wyble, B., 2020. I tried a bunch of things: The dangers of unexpected overfitting in classification of brain data. Neuroscience & Biobehavioral Reviews 119, 456–467. URL:, doi:10.1016/j.neubiorev.2020.09.036.
  • Hutter et al. (2019) Hutter, F., Kotthoff, L., Vanschoren, J. (Eds.), 2019. Automated Machine Learning : Methods, Systems, Challenges. Springer Nature.
  • Isensee et al. (2021) Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H., 2021. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18, 203–211. URL:, doi:10.1038/s41592-020-01008-z.
  • Jack et al. (2015) Jack, C.R., Barnes, J., Bernstein, M.A., Borowski, B.J., Brewer, J., Clegg, S., Dale, A.M., Carmichael, O., Ching, C., DeCarli, C., Desikan, R.S., Fennema-Notestine, C., Fjell, A.M., Fletcher, E., Fox, N.C., Gunter, J., Gutman, B.A., Holland, D., Hua, X., Insel, P., Kantarci, K., Killiany, R.J., Krueger, G., Leung, K.K., Mackin, S., Maillard, P., Malone, I.B., Mattsson, N., McEvoy, L., Modat, M., Mueller, S., Nosheny, R., Ourselin, S., Schuff, N., Senjem, M.L., Simonson, A., Thompson, P.M., Rettmann, D., Vemuri, P., Walhovd, K., Zhao, Y., Zuk, S., Weiner, M., 2015. Magnetic resonance imaging in alzheimer’s disease neuroimaging initiative 2. Alzheimer’s & Dementia 11, 740–756. URL:, doi:10.1016/j.jalz.2015.05.002.
  • Kovesi (2003) Kovesi, P., 2003.

    Phase congruency detects corners and edges, in: The Australian pattern recognition society conference: DICTA.

  • Lambin et al. (2017) Lambin, P., Leijenaar, R.T., Deist, T.M., Peerlings, J., de Jong, E.E., van Timmeren, J., Sanduleanu, S., Larue, R.T., Even, A.J., Jochems, A., van Wijk, Y., Woodruff, H., van Soest, J., Lustberg, T., Roelofs, E., van Elmpt, W., Dekker, A., Mottaghy, F.M., Wildberger, J.E., Walsh, S., 2017. Radiomics: the bridge between medical imaging and personalized medicine. Nature Reviews Clinical Oncology 14, 749–762. URL:, doi:10.1038/nrclinonc.2017.141.
  • Lambin et al. (2012) Lambin, P., Rios-Velazquez, E., Leijenaar, R., Carvalho, S., van Stiphout, R.G., Granton, P., Zegers, C.M., Gillies, R., Boellard, R., Dekker, A., Aerts, H.J., 2012. Radiomics: Extracting more information from medical images using advanced feature analysis. European Journal of Cancer 48, 441–446. URL:, doi:10.1016/j.ejca.2011.11.036.
  • Lemaitre et al. (2017) Lemaitre, G., Nogueira, F., Aridas, C.K., 2017. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18.
  • Macskassy et al. (2005) Macskassy, S.A., Provost, F., Rosset, S., 2005. ROC confidence bands: An empirical evaluation, in: Proceedings of the 22nd international conference on Machine learning, ACM. pp. 537–544. doi:10.1145/1102351.1102419.
  • Mao et al. (2021) Mao, Y., Zhong, G., Wang, Y., Deng, Z., 2021. Differentiable light-weight architecture search. URL:, doi:10.1109/icme51207.2021.9428132.
  • Marcus et al. (2007) Marcus, D.S., Olsen, T.R., Ramaratnam, M., Buckner, R.L., 2007. The extensible neuroimaging archive toolkit. Neuroinformatics 5, 11–33. URL:, doi:10.1385/ni:5:1:11.
  • Nadeau and Bengio (2003) Nadeau, C., Bengio, Y., 2003. Inference for the generalization error. Machine Learning 52, 239–281. doi:10.1023/A:1024068626366.
  • Nioche et al. (2018) Nioche, C., Orlhac, F., Boughdad, S., Reuzé, S., Goya-Outi, J., Robert, C., Pellot-Barakat, C., Soussan, M., Frouin, F., Buvat, I., 2018. LIFEx: A freeware for radiomic feature calculation in multimodality imaging to accelerate advances in the characterization of tumor heterogeneity. Cancer Research 78, 4786–4789. URL:, doi:10.1158/0008-5472.can-18-0125.
  • O’Connor et al. (2017) O’Connor, J.P.B., Aboagye, E.O., Adams, J.E., Aerts, H.J.W.L., Barrington, S.F., Beer, A.J., Boellaard, R., Bohndiek, S.E., Brady, M., Brown, G., Buckley, D.L., Chenevert, T.L., Clarke, L.P., Collette, S., Cook, G.J., deSouza, N.M., Dickson, J.C., Dive, C., Evelhoch, J.L., Faivre-Finn, C., Gallagher, F.A., Gilbert, F.J., Gillies, R.J., Goh, V., Griffiths, J.R., Groves, A.M., Halligan, S., Harris, A.L., Hawkes, D.J., Hoekstra, O.S., Huang, E.P., Hutton, B.F., Jackson, E.F., Jayson, G.C., Jones, A., Koh, D.M., Lacombe, D., Lambin, P., Lassau, N., Leach, M.O., Lee, T.Y., Leen, E.L., Lewis, J.S., Liu, Y., Lythgoe, M.F., Manoharan, P., Maxwell, R.J., Miles, K.A., Morgan, B., Morris, S., Ng, T., Padhani, A.R., Parker, G.J.M., Partridge, M., Pathak, A.P., Peet, A.C., Punwani, S., Reynolds, A.R., Robinson, S.P., Shankar, L.K., Sharma, R.A., Soloviev, D., Stroobants, S., Sullivan, D.C., Taylor, S.A., Tofts, P.S., Tozer, G.M., van Herk, M., Walker-Samuel, S., Wason, J., Williams, K.J., Workman, P., Yankeelov, T.E., Brindle, K.M., McShane, L.M., Jackson, A., Waterton, J.C., 2017. Imaging biomarker roadmap for cancer studies. Nature Reviews Clinical Oncology 14, 169–186. URL:, doi:10.1038/nrclinonc.2016.162.
  • Ojala et al. (2002) Ojala, T., Pietikainen, M., Maenpaa, T., 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 971–987. URL:, doi:10.1109/tpami.2002.1017623.
  • Olson and Moore (2019) Olson, R.S., Moore, J.H., 2019. TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. Springer Science and Business Media LLC. pp. 151–160. URL:, doi:10.1007/978-3-030-05318-5_8.
  • Open Science Collaboration (2015) Open Science Collaboration, 2015. Estimating the reproducibility of psychological science. Science 349, aac4716–aac4716. URL:, doi:10.1126/science.aac4716.
  • Parekh and Jacobs (2016) Parekh, V., Jacobs, M.A., 2016. Radiomics: a new application from established techniques. Expert Review of Precision Medicine and Drug Development 1, 207–226. URL:, doi:10.1080/23808993.2016.1164013.
  • Parekh and Jacobs (2019) Parekh, V.S., Jacobs, M.A., 2019. Deep learning and radiomics in precision medicine. Expert Review of Precision Medicine and Drug Development 4, 59–72. URL:, doi:10.1080/23808993.2019.1585805.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É., 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12, 2825–2830.
  • Pfaehler et al. (2019) Pfaehler, E., Zwanenburg, A., de Jong, J.R., Boellaard, R., 2019. RaCaT: An open source and easy to use radiomics calculator tool. PLOS ONE 14, e0212223. URL:, doi:10.1371/journal.pone.0212223.
  • Picard and Cook (1984) Picard, R.R., Cook, R.D., 1984. Cross-validation of regression models. Journal of the American Statistical Association 79, 575–583. URL:, doi:10.1080/01621459.1984.10478083.
  • Rathore et al. (2018) Rathore, S., Bakas, S., Pati, S., Akbari, H., Kalarot, R., Sridharan, P., Rozycki, M., Bergman, M., Tunc, B., Verma, R., Bilello, M., Davatzikos, C., 2018. Brain cancer imaging phenomics toolkit (brain-CaPTk): An interactive platform for quantitative analysis of glioblastoma. Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries 10670, 133–145. URL:, doi:10.1007/978-3-319-75238-9_12.
  • Rizzo et al. (2018) Rizzo, S., Botta, F., Raimondi, S., Origgi, D., Fanciullo, C., Morganti, A.G., Bellomi, M., 2018. Radiomics: the facts and the challenges of image analysis. European Radiology Experimental 2, 36. URL:, doi:10.1186/s41747-018-0068-z.
  • dos Santos et al. (2021) dos Santos, D.P., Dietzel, M., Baessler, B., 2021. A decade of radiomics research: are images really data or just patterns in the noise? European Radiology 31, 1–4. URL:, doi:10.1007/s00330-020-07108-w.
  • Sollini et al. (2019) Sollini, M., Antunovic, L., Chiti, A., Kirienko, M., 2019.

    Towards clinical application of image mining: a systematic review on artificial intelligence and radiomics.

    European Journal of Nuclear Medicine and Molecular Imaging 46, 2656–2672. URL:, doi:10.1007/s00259-019-04372-x.
  • Song et al. (2020) Song, J., Yin, Y., Wang, H., Chang, Z., Liu, Z., Cui, L., 2020. A review of original articles published in the emerging field of radiomics. European Journal of Radiology 127, 108991. URL:, doi:10.1016/j.ejrad.2020.108991.
  • Starmans (2021) Starmans, M.P.A., 2021. WORCDatabase. URL:
  • Starmans et al. (2021a) Starmans, M.P.A., Buisman, F.E., Renckens, M., Willemssen, F.E.J.A., Van der Voort, S.R., Groot Koerkamp, B., Grünhagen, D.J., Niessen, W.J., Vermeulen, P.B., Verhoef, C., Visser, J.J., Klein, S., 2021a. Distinguishing pure histopathological growth patterns of colorectal liver metastases on CT using deep learning and radiomics: a pilot study. In Revision .
  • Starmans et al. (2020a) Starmans, M.P.A., Els, C.J., Fiduzi, F., Niessen, W.J., Klein, S., Dwarkasing, R.S., 2020a. Radiomics model to predict hepatocellular carcinoma on liver MRI of high-risk patients in surveillance: a proof-of-concept study, in: European Congress of Radiology (ECR) 2020 Book of Abstracts, p. 34. doi:10.1186/s13244-020-00851-0.
  • Starmans et al. (2021b) Starmans, M.P.A., Miclea, R.L., Vilgrain, V., Ronot, M., Purcell, Y., Verbeek, J., Niessen, W.J., Ijzermans, J.N.M., de Man, R.A., Doukas, M., Klein, S., Thomeer, M.G., 2021b. Automated differentiation of malignant and benign primary solid liver lesions on MRI: an externally validated radiomics model. medRxiv doi:10.1101/2021.08.10.21261827.
  • Starmans et al. (2021c) Starmans, M.P.A., Timbergen, M.J.M., Vos, M., Padmos, G.A., Grünhagen, D.J., Verhoef, C., Sleijfer, S., van Leenders, G.J.L.H., Buisman, F.E., Willemssen, F.E.J.A., GrootKoerkamp, B., Angus, L., van der Veldt, A.A.M., Rajicic, A., Odink, A.E., Renckens, M., Doukas, M., de Man, R., Ijzermans, J.N.M., Miclea, R.L., Vermeulen, P.B., Thomeer, M.G., Visser, J.J., Niessen, W.J., Klein, S., 2021c. The WORC database: MRI and CT scans, segmentations, and clinical labels for 932 patients from six radiomics studies. Data in Brief Co-Submission.
  • Starmans et al. (2020b) Starmans, M.P.A., Timbergen, M.J.M., Vos, M., Renckens, M., Grünhagen, D.J., van Leenders, G.J.L.H., Dwarkasing, R.S., Willemssen, F.E.J.A., Niessen, W.J., Verhoef, C., Sleijfer, S., Visser, J.J., Klein, S., 2020b. Differential diagnosis and molecular stratification of gastrointestinal stromal tumors on CT images using a radiomics approach. arXiv:2010.06824 URL: arXiv:2010.06824.
  • Starmans et al. (2020c) Starmans, M.P.A., van der Voort, S.R., Castillo T, J.M., Veenland, J.F., Klein, S., Niessen, W.J., 2020c. Radiomics: Data mining using quantitative medical image features. Academic Press. chapter 18. pp. 429–456. doi:10.1016/B978-0-12-816176-0.00023-5.
  • Starmans et al. (2018a) Starmans, M.P.A., Van der Voort, S.R., Phil, T., Klein, S., 2018a. Workflow for optimal radiomics classification (WORC). URL:
  • Starmans et al. (2018b) Starmans, M.P.A., Van der Voort, S.R., Phil, T., Klein, S., 2018b. Workflow for optimal radiomics classification (WORC) documentation. URL:
  • Su et al. (2019) Su, X., Chen, N., Sun, H., Liu, Y., Yang, X., Wang, W., Zhang, S., Tan, Q., Su, J., Gong, Q., Yue, Q., 2019. Automated machine learning based on radiomics features predicts H3 K27M mutation in midline gliomas of the brain. Neuro-Oncology URL:, doi:10.1093/neuonc/noz184.
  • Sun et al. (2019) Sun, H., Qu, H., Chen, L., Wang, W., Liao, Y., Zou, L., Zhou, Z., Wang, X., Zhou, S., 2019. Identification of suspicious invasive placentation based on clinical mri data using textural features and automated machine learning. European Radiology 29, 6152–6162. URL:, doi:10.1007/s00330-019-06372-9.
  • Szczypiński et al. (2009) Szczypiński, P.M., Strzelecki, M., Materka, A., Klepaczko, A., 2009. Mazda—a software package for image texture analysis. Computer Methods and Programs in Biomedicine 94, 66–76. URL:, doi:10.1016/j.cmpb.2008.08.005.
  • Tharwat (2021) Tharwat, A., 2021. Classification assessment methods. Applied Computing and Informatics 17, 168–192. URL:, doi:10.1016/j.aci.2018.08.003.
  • Thornton et al. (2013) Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K., 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms, in: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 847–855. doi:10.1145/2487575.2487629.
  • Timbergen et al. (2020) Timbergen, M.J.M., Starmans, M.P.A., Padmos, G.A., Grünhagen, D.J., van Leenders, G.J.L.H., Hanff, D., Verhoef, C., Niessen, W.J., Sleijfer, S., Klein, S., Visser, J.J., 2020. Differential diagnosis and mutation stratification of desmoid-type fibromatosis on MRI using radiomics. European Journal of Radiology 131, 109266. URL:, doi:10.1016/j.ejrad.2020.109266.
  • Traverso et al. (2018) Traverso, A., Wee, L., Dekker, A., Gillies, R., 2018. Repeatability and reproducibility of radiomic features: A systematic review. International Journal of Radiation Oncology*Biology*Physics 102, 1143–1158. URL:, doi:10.1016/j.ijrobp.2018.05.053.
  • Urbanowicz et al. (2018) Urbanowicz, R.J., Olson, R.S., Schmitt, P., Meeker, M., Moore, J.H., 2018. Benchmarking relief-based feature selection methods for bioinformatics data mining. Journal of Biomedical Informatics 85, 168–188. URL:, doi:10.1016/j.jbi.2018.07.015.
  • van der Voort et al. (2019a) van der Voort, S., Incekara, F., Wijnenga, M., Kapas, G., Gardeniers, M., Schouten, J., Starmans, M., NandoeTewarie, R., Lycklama, G., French, P., Dubbink, H., van den Bent, M., Vincent, A., Niessen, W., Klein, S., Smits, M., 2019a. Data belonging to predicting the 1p/19q co-deletion status of presumed low grade glioma with an externally validated machine learning algorithm. URL:, doi:10.17632/rssf5nxxby.1.
  • van der Voort et al. (2019b) van der Voort, S.R., Incekara, F., Wijnenga, M.M., Kapas, G., Gardeniers, M., Schouten, J.W., Starmans, M.P., Tewarie, R.N., Lycklama, G.J., French, P.J., Dubbink, H.J., van den Bent, M.J., Vincent, A.J., Niessen, W.J., Klein, S., Smits, M., 2019b. Predicting the 1p/19q codeletion status of presumed low-grade glioma with an externally validated machine learning algorithm. Clinical Cancer Research 25, 7455–7462. URL:, doi:10.1158/1078-0432.ccr-19-1127.
  • van der Voort and Starmans (2018) van der Voort, S.R., Starmans, M.P.A., 2018. Predict: a radiomics extensive digital interchangable classification toolkit (PREDICT). URL:
  • Vos et al. (2019) Vos, M., Starmans, M.P.A., Timbergen, M.J.M., van der Voort, S.R., Padmos, G.A., Kessels, W., Niessen, W.J., van Leenders, G.J.L.H., Grünhagen, D.J., Sleijfer, S., Verhoef, C., Klein, S., Visser, J.J., 2019. Radiomics approach to distinguish between well differentiated liposarcomas and lipomas on MRI. British Journal of Surgery 106, 1800–1809. URL:, doi:10.1002/bjs.11410.
  • Yang et al. (2020) Yang, B., Zhong, J., Zhong, J., Ma, L., Li, A., Ji, H., Zhou, C., Duan, S., Wang, Q., Zhu, C., Tian, J., Zhang, L., Wang, F., Zhu, H., Lu, G., 2020. Development and validation of a radiomics nomogram based on 18F-Fluorodeoxyglucose positron emission tomography/computed tomography and clinicopathological factors to predict the survival outcomes of patients with non-small cell lung cancer. Frontiers in Oncology 10, 1042. URL:, doi:10.3389/fonc.2020.01042.
  • Yip and Aerts (2016) Yip, S.S.F., Aerts, H.J.W.L., 2016. Applications and limitations of radiomics. Physics in Medicine and Biology 61, R150–R166. URL:, doi:10.1088/0031-9155/61/13/r150.
  • Yoo et al. (2003) Yoo, A.B., Jette, M.A., Grondona, M., 2003. SLURM: Simple linux utility for resource management. Job Scheduling Strategies for Parallel Processing 2862, 44–60. URL:, doi:10.1007/10968987_3.
  • Yu et al. (2019) Yu, F.H., Wang, J.X., Ye, X.H., Deng, J., Hang, J., Yang, B., 2019. Ultrasound-based radiomics nomogram: A potential biomarker to predict axillary lymph node metastasis in early-stage invasive breast cancer. European Journal of Radiology 119, 108658. URL:, doi:10.1016/j.ejrad.2019.108658.
  • Zhang and Ma (2012) Zhang, C., Ma, Y. (Eds.), 2012. Ensemble Machine Learning. Springer Science and Business Media LLC, New York. URL:, doi:10.1007/978-1-4419-9326-7.
  • Zhang et al. (2015) Zhang, L., Fried, D.V., Fave, X.J., Hunter, L.A., Yang, J., Court, L.E., 2015. ibex: An open infrastructure software platform to facilitate collaborative work in radiomics. Medical Physics 42, 1341–1353. URL:, doi:10.1118/1.4908210.
  • Zwanenburg et al. (2020) Zwanenburg, A., Vallières, M., Abdalah, M., Aerts, H., Andrearczyk, V., Apte, A., Ashrafinia, S., Bakas, S., Beukinga, R., Boellaard, R., Bogowicz, M., Boldrini, L., Buvat, I., Cook, G., Davatzikos, C., Depeursinge, A., Desseroit, M.C., Dinapoli, N., Dinh, C., Löck, S., 2020. The image biomarker standardization initiative: Standardized quantitative radiomics for high-throughput image-based phenotyping. Radiology 295, 191145. doi:10.1148/radiol.2020191145.