Radiomics, has recently been the focus of a lot of interest and research as a method of extracting more information from clinical imaging and therefore help solve different clinical problems. With the advances in image processing and data science, medical images (computed tomography, magnetic resonance, positron emission tomography images, etc.) are converted to minable high-dimensional data, which may carry latent information that can facilitate clinical decision-making. In general, a radiomics pipeline consists of 4 steps: image acquisition, tumor segmentation, radiomics feature extraction and statistical/machine learning-based analysis[Court2016]. A number of studies have shown that any of these 4 steps can cause the variability of radiomics features [Lee2019, Monica2018, Bianchi2019, Foy2018, Shafig2017, Bogowicz2016, Kumar2012, Schwier2019]. For example, currently, different imaging centers use varying image acquisition protocols. The lack of consensus or guidelines on image acquisition across institutions leads to imaging data heterogeneity among different studies [Mackin2015]. The accuracy of manual segmentation also affects the robustness and stability of radiomics models, however it is a viable solution in scenarios of contouring highly varying complex tumor boundaries with low tumor-tissue contrast using automated methods [vanVelden2016].
While researchers have recognized the relative lack of consistency and reproducibility, there is a dearth of approaches to address this issue. Many research institutions have developed their in-house or open-source programs for radiomic analysis. These programs usually run different implementations of the same features but with different terminologies. In some cases, the same metrics may not generate identical values due to differences in implementing the underlying algorithms e.g. image intensity normalization based on just the tumor region or on entire image may generate different results.
To address the standardization of radiomics features, the image biomarker standardization initiative (IBSI), an international collaboration, was founded [Zwanenburg2020]. The IBSI have fully propose the definitions of 11 commonly used feature classes. A digital phantom and a CT phantom, with their benchmark values of features, have been created for the purpose of standardization.
Recent analyses have been conducted on datasets for evaluating the variation of features among radiomics software programs, showing that large variations were found due to image preprocessing, parameters and algorithm implementations [Monica2018, Bianchi2019, Foy2018]. As these studies focused on specific private datasets and limited number of features, the results are not generalizable. In this study, the issue of generalization was addressed by using the IBSI benchmarking to generate reliable results.
2 Materials and Methods
2.1 Digital Phantom
The digital phantom developed by the IBSI is geometrically small, consisting of 5 4 4 voxels. Fig. 1 shows four slices of the phantom. Grey levels are designed as integers and hence, discretization is not required before computing features. Grey levels range from 1 to 9 in the whole volume and from 1 to 6 in the region of interest (ROI). Level 2 and 5 are absent. Voxels are isotropic, with 2.0 2.0 2.0 mm spacing for the three dimentions.
2.2 IBSI-standardized Features
To comprehensively compare inter-software differences, 11 classes of radiomics features were evaluated in this study: morphology, local intensity, intensity-based statistic, intensity histogram, intensity-volume histogram, grey level co-occurrence matrix (GLCM), grey level run length matrix (GLRLM), grey level size zone matrix (GLSZM), grey level distance zone matrix (GLDZM), neighborhood grey tone difference matrix (NGTDM) and neighboring grey level dependence matrix (NGLDM) based features. These features are standardized and summarized from various technical references in the IBSI. According to their mathematical definitions, the 11 classes of radiomics features characterize various properties of ROI, and they can be grouped into 3 major categories: 1) statistical/histogram, 2) texture, and 3) morphology features (Table 1).
shows the number of features within the IBSI standards that fall within each of the aforementioned categories. While the IBSI has a specified terminology, similar features may be named differently in different software programs. As an example, GLCM joint maximum is tagged as maximum probability in some programs e.g. Pyradiomics[van_Griethuysene2017], QIFE [Echegaray2018] or studies [Aerts2014, Wu2016, COROLLER2015345]. In some software programs [van_Griethuysene2017]
, kurtosis is computed with the formula of excess kurtosis, which is an alternative definition of kurtosis. While excess kurtosis is equal to kurtosis – 3, to perform comparisons to IBSI standards the values of kurtosis must be corrected manually after running these software programs. To conduct a fair comparison of metrics across the different software, we created a glossary of matching radiomic metrics, with their different terminologies, across different radiomics software (Supplementary 1).
2.3 Radiomics Software Programs
Six publicly available software programs: Pyradiomics [van_Griethuysene2017, Dou2018, vanGriethuysen2020], the Medical Imaging Interaction Toolkit (MITK) [GOTZ2019108, Kickingereder5765, Kickingereder2016], LIFEx [Nioche, Nioche01052017, Nardone2018], the Standardized Environment for Radiomics Analysis (SERA) [Ashrafinia2019, Ashrafinia01052018, Du2020], Cancer Imaging Phenomics Toolkit (CaPTk) [Davatzikos2018, Pati2020, Rathore2018], a MATLAB library from McGill University (A2, GitHub Repo: radiomics-develop) [Zwanenburg2020, Valli_res_2015, Vallieres2017] and an in-house MATLAB library (A1, University of Southern California [Hassani2019, Varghese2019, Varghese20192]) are evaluated here (Table 3). These programs were included in our study as they are widely used in radiomics research and span across different software development languages and operating systems.
While the underlying algorithm remains the same, its application differs across different software (Table 4). For example, for morphology features, researchers commonly use Marching Cubes algorithms for meshing presentation, as it works efficiently on multiple programming platforms [Lorensen1987]. However, more efficient implementations were proposed by other researchers [NEWMAN2006854]. In fact, the IBSI also recommend the methodology proposed by Lewiner [Lewiner2003], optimizing the performance and feasibility. Also, intensity-based statistics and histogram features are already well defined in common image processing field and therefore, their computing results should mainly rely on relatively established programming strategies. For example, for textural features, all software basically cite the same publications, where GLCM was proposed by Haralick [Haralick1973], GLRLM was proposed by Galloway [GALLOWAY1975172], GLSZM was proposed by Thibault [Thibault2009, Thibault2014], GLDZM was derived from Thibault [Thibault2014], NGTDM was proposed by Amadasun [Amadasun1989] and NGLDM was proposed by Sun [SUN1983341].
The level of quantitative assessment provided by different radiomic software is different, i.e., not all software programs includes all 173 IBSI-standardized features. Table 1 shows the number of features computed by each software. To better understand the distribution of the 173 different IBSI-defined features across the different radiomics software, an UpSet diagram and Venn diagram (Fig. 2) approach was used. Using this approach, the “popularity” of the 3 major categories of features were analyzed, i.e., how many features were shared by different radiomics software programs was identified. Only the 6 publicly available software programs were considered in the diagrams for the reason that they may be designed for general radiomics analysis, while in-house software may be designed for application-specific scenarios. As the total number of features of each category are different across and within software, two ad hoc metrics to quantify the “popularity” of the each of the 3-feature category, based on the 6 software were defined:
where indicates the number of software that support the features, indicates the number of features. Features supported in more than 4 out of 6 software programs are considered as ”popular”. These are highlighted in the UpSet diagram in Fig. 2. Intuitively, quantifies the weighted cardinality of the specific group of radiomics features; quantifies the percentage of ”popular” features in the specific group. Large values in both metrics indicate that radiomics features are widely supported by various software (Fig. 3).
2.4 Computational Workflow
In general, the computational workflow can be divided into two stages: 1) preprocessing and 2) feature computation (Fig. (a)a
). Two preprocessing operations are included in most studies and introduced in the IBSI: voxel interpolation and intensity discretization.
As texture features are sensitive to voxel size variation, interpolation is required. Different algorithms for interpolation, such as nearest neighbor, bilinear, B-Spline interpolation are used. Intensity discretization is useful for producing a reasonable quantitative analysis and removing noise. Two approaches are defined in the IBSI: 1) fixed bin number and fixed bin width [Zwanenburg2020]:
where denotes the discrete intensity for voxel , denotes the number of bins, denotes the original gray level of voxel , and denotes the minimum and maximum gray level in the ROI, and denotes the interval width of each bin. Number of bins and width of bins need to be specify for fixed bin number and fixed bin width, respectively.
As the digital phantom was developed using discretized intensities and isotropic voxels, no preprocessing is needed. While most of the radiomics software programs allow modular computing, LIFEx and CaPTk are rigid and preprocessing cannot be skipped. To address this problem, specific parameters of voxel interpolation and intensity discretization are assigned to remove their respective contribution. Expected spacing size is needed to specify for interpolation. Spacing size 2.0 2.0 2.0 mm, which is identical to the original spacing, is set for voxel interpolation because we expect the spacing not to change. For intensity discretization, LIFEx use fixed bin width approach with equal to 1 and CaPTk use fixed bin number approach with equal to 6. Similar to the setting for interpolation, we expect both of these settings for discretization do not change voxel intensities. In this study, five basic statistic features are chosen to compare mean, median, minimum, maximum discretized intensity and discretized intensity range across the different software, after intensity discretization to reveal basic properties of intensity distribution which has been used to evaluate the different software.
Feature computation is the next step after either voxel interpolation or intensity discretization (Fig. (a)a). For non-morphology features, the feature values are directly calculated from the 3D volume data. Morphology features, however, by definition, are defined under polygon meshing representation, leading to a two-stage computation workflow (Fig. (b)b). Image volume data is first transformed to meshing representation, then morphology features are calculated based on it.
Table 5 showsshows the parameters in each step of computation. To minimize the effect of different parameters and generalize the results, parameter settings are designed to maintain the highest consistency. For GLCM, GLRLM, GLSZM, GLDZM, NGTDM and NGLDM features, voxel neighborhoods are measured along all possible direction/angle in 3-Dimensional (3D) space, which mean that a central voxel has 26 neighbors. The distance between the central voxel and the neighbor voxel is set as 1 by default. As each direction can generate one GLCM and GLRLM by definition, feature aggregation should be used to summarize feature values from all directions. In general, the information of features from multiple directions are summarized in two different ways. LIFEx averages the features over 26 directions, while the other software programs compute the features after merging matrices from all directions.
As the voxel interpolation and intensity discretization are fixed in CaPTk and LIFEx, we compare the agreement of these 5 features first to confirm that preprocessing does not change the voxel intensities (Table 6). Values of mean, median, minimum and maximum intensity calculated by CaPTk are 1 unit less than the benchmark values, while the intensity range is equal. This indicates that CaPTk set the minimum grey level as 0 instead of 1 and all voxel intensities are shifted by -1. LIFEx does not include the 5 features, hence effect of preprocessing is unclear.
Software agreement for each single feature is measured by relative difference between the calculated feature values and the benchmark values provided by the IBSI (Eq. 5). Significant differences in overall features are noted from CaPTk analysis (Fig. 5), possibly from the effect of intensity shifting in the step of preprocessing. Good agreement is found between the other 6 software programs.
To study the effect of intensity shifting, we modified the code for intensity shifting in our in house pipeline and produced the results of GLCM as an example. Results based on the intensity shifting in our in house pipeline and producing the results of GLCM as an example, revealed that that some features (e.g. joint average, joint entropy, sum average and inverse difference moment) have significant deviation after adding the intensity shifting, while others (e.g. contrast, sum entropy) are unchanged (Fig.6).
A scatter plot was used to analyze the agreement (measured by relative difference) among different feature classes (Fig. 7). Since CaPTk preprocessed the intensity by shifting -1 unit, which is different to all other software, we considered all the other software programs in Fig. 7. Good agreement is also observed among non-morphology features, while poor agreement is observed for morphology features
As morphology features are computed under the meshing representation, the computed meshes may also affect feature values. Among 7 software programs, we extracted the computed meshes from SERA and A2 at their intermediate steps (Fig. 8). Other software programs are not able to give the meshes data, either because they are programmed as “end-to-end” pipelines or are not open-source. The meshing of SERA slightly differs from the one of A2. Values of volumes of them are also different (556 for SERA, 554 for A2).
This study has demonstrated that the design and implementation of radiomics feature calculation vary across multiple software programs, using a single IBSI standard phantom. While in other early studies [Monica2018, Bianchi2019, Foy2018]
, the comparison and analyses of radiomics reproducibility, software agreement were reported using statistical hypothesis testing[Monica2018, Foy2018], spearman correlation coefficients [Bianchi2019] or intraclass-correlation coefficients [Foy2018] (ICC), using a set of features from datasets.
Different strategies of intensity discretization have an influence on the resulted feature values. As shown in Fig. 6, two strategies (minimum intensity as 0 vs. minimum intensity as 1) produce different feature values. The reason for different values is that 0 is introduced in the computation, thus partial terms will be lost. Given the two different strategies, as an example, GLCM Joint Entropy will have two different mathematical definitions:
where denotes the element of GLCM at the location of row and column. According to the definition of GLCM, both cases will result in identical , but the second-order statistical features may be affected because zeros are introduced in the calculation. Partial terms of co-occurrence will be lost in the multiplication operation with 0, causing relatively smaller feature values. This discretization strategy is not unique to just CaPTk; other studies also use the same mathematical definition, setting the minimum intensity to zero [Mohanaiah2013ImageTF, Humeau2019, Gade2018]. By definition, the two discretization approaches given by the IBSI (Eq. 3 and Eq. 4), add a constant unit of one to the minimum intensity avoid zeros encountered in the quantized image. Discretization should therefore adhere to the IBSI standard not only for reproducibility, but also for retaining the partial terms in the computation of the co-occurrence matrix to fully and precisely characterize the properties of ROI.
Disagreement among feature values also occurred frequently in the group of morphological features. By definition, the computation of morphology features is more complex than non-morphology features. Statistical and textural features are defined unequivocally on the 3D volume of ROI; hence the computation goes directly using matrix algebra. The computation of morphology features, by contrast, includes two stages. The source of feature value variation may come from the first stage, where volume data is transformed as meshes data. There are various algorithms for meshing approximation, where different algorithms may result in different meshes and therefore produce different feature values. As Fig. 8 shown, SERA and A2 may use different algorithms and give two different approximations and values of volumes. Therefore, careful consideration needs to be given to the choice of morphometric features used particularly in multi-center radiomics analysis.
The “popularity” of each features is important for analysis of software agreement. With more software giving a same feature, the analysis of it is more reliable. Within 173 IBSI-standardized features, Fig. 2 shows that the 6 publicly available software programs share 1/29 morphology feature, 7/50 statistics and histogram features and 21/94 texture features. Furthermore, both ad hoc metrics, and , have lower values for morphology features (Fig. 3). These indicate that relatively less attention is provided to morphology properties of an ROI compared to its non-morphological properties. This may be true considering the computational complexity behind extracting morphology features, i.e., the algorithms and programming of meshing representations are usually challenging, tedious and error prone [Lorensen1987, Chernikov2014, DELIBASIS2001343, RAJON2003411]. Taken together these reasons justify the poor representation and feature inconsistency of morphological metrics within radiomics software.
The primary aim of most radiomics studies is to identify significant features and build discriminant models. Therefore, it is desirable to improve the model performance by including more candidate features in the statistical analysis. With more features, in general, it is more possible to find a better subset of radiomics features to improve the performance, achieving higher accuracy, area under the curve to aid discrimination. In addition, morphology features show significant prognostic power in the application of various cancer types [Parmar2015, CUOCOLO2019144, Zhu2015]. Therefore, despite the computational complexity in extracting morphology features which are rich in information, possibly distinct from that provided by non-morphological metrics, they should be included in analysis. The results reveal that current software programs lack standardization. Benchmarking may be important for a reproducible radiomics study. The IBSI digital phantom is a good resource for validating programs with the provided benchmark values for each proposed feature. In most studies, images are analyzed with preprocessing (normalization, transformation, discretization). While the phantom is not be able to contribute to the benchmarking of preprocessing directly because it does not require additional processing, it may be helpful indirectly by checking the basic features in Table 6, where the difference of discretization approach are found. The intensity-based statistical features are calculated right after interpolation, which can be used for benchmarking of image interpolation; intensity histogram features are calculated right after gray level discretization, which can be used for benchmarking of both interpolation and discretization (Fig. (a)a).
We conclude that future radiomic studies should provide implementation details about the radiomics software such as the name of the software, its version, whether it is open-source or custom-built, the programming language used to develop it etc., so that appropriate comparisons of the results across different radiomics studies can be performed. In addition, consistent terminology of the features should be used, ideally adhering to the IBSI terminologies. Features not used in the IBSI list should be thoroughly defined and supported by references, so future studies can validate them.
This study evaluated only 6 publicly available and 1 in-house program. Adding more software programs may strengthen our findings. For some programs, computation and code running tracking may not available, leading to the difficulty in matching the program available to the program used to publish results. While some programs are readable (e.g. MATLAB toolbox), other programs only provide graphical user interface (GUI) for the convenience of non-professional users and codes are not visible. As an example, LIFEx is programmed in Java, and the source and reason of features deviation is not easy to understand.
Further analysis could focus on the feature comparison on real-world datasets. With a larger range of gray levels in the real cases, different imaging modalities, the features agreement may be different. Most studies employ image enhancement, augmentation, and transformation before feature extraction to find more significant features, such as image filtering, edge enhancement. These operations may also affect the degree of variation across radiomics programs.
We evaluated feature agreement in the use of 7 different radiomics software programs. While most first-order and second-order texture features show satisfactory agreement, morphological features show significant variation. In addition, most programs are relatively weak on morphology analysis possibly owing to its computation complexity. Further work is necessary to standardize the calculation of radiomics features across different radiomics software, which is one of the first steps towards the clinical translation of radiomic analysis.
|Statistical/histogram features||Texture features||Morphology (shape) features|
|Local intensity||GLCM||Morphology features|
|Local intensity features||2||0||0||2||0||2||0||1|
|Intensity-based statistical features||18||6||15||18||5||18||15||18|
|Intensity histogram features||23||0||2||21||2||19||17||23|
|Intensity-volume histogram features||7||0||0||7||0||7||0||7|
|Language||Supported Operating System||Significant Dependent Library|
|A1||MATLAB||Windows, Linux, MacOS||None|
|Pyradiomics||Python||Windows, Linux, MacOS||SimpleITK|
|MITK||C++||Windows, Linux, MacOS||the Insight Toolkit (ITK), the Visualization Toolkit (VTK)|
|LIFEx||Java||Windows, Linux, MacOS||None|
|SERA||MATLAB||Windows, Linux, MacOS||None|
|CaPTk||C++||Windows, Linux, MacOS||OpenCV, ITK, VTK|
|A2||MATLAB||Windows, Linux, MacOS||None|
|Local intensity features||-||-||-||-||-||-||IBSI[Zwanenburg2020]||Wahl[Wahl2009]|
|Intensity-based statistical features||-||-||-||-||IBSI[Zwanenburg2020]||-||IBSI[Zwanenburg2020]||-|
|Intensity histogram features||-||-||-||-||-||Macyszyn[Macyszyn2015]||IBSI[Zwanenburg2020]||-|
|Intensity-volume histogram features||-||-||-||-||IBSI[Zwanenburg2020]||-||IBSI[Zwanenburg2020]||El Naqa[ELNAQA20091162]|
|Interpolation||-||-||-||2.0 2.0 2.0 mm||-||2.0 2.0 2.0 mm||-|
|Discretization||-||-||-||Fix bin size of 1.0||-||Fix bin number of 6||-|
|A1||Pyradiomics||MITK||LIFEx||SERA||CaPTk||A2||IBSI Benchmark Value|
|Mean discretised intensity||-||-||2.1643||-||2.1486||1.1486||2.1486||2.15|
|Median discretised intensity||-||-||1.0156||-||1||0||1||1|
|Minimum discretised intensity||-||-||1||-||1||0||1||1|
|Maximum discretised intensity||-||-||6||-||6||5||6||6|
|Discretised intensity range||-||-||5||-||5||5||5||5|