Quantitative medical image computing, known as radiomics, has been an emerging field in medical image analysis. The goal of radiomics is to extract hundreds of features that are mathematical summarizations of the volume-of-interest (VOI) from computed tomography (CT), magnetic resonance imaging (MRI) or positron emission tomography (PET), for the purpose of disease characterization and patient stratification. Imaging biomarkers built on radiomic features have demonstrated great performance in predicting patients outcome [6, 2, 8]. However, there are two critical issues in radiomics analysis.
First, there is always the possibility of overfitting. Building statistically powerful models at medically meaningful effect sizes using hundreds of radiomic features would at least require thousands of samples 
, while such datasets are rarely available for research in medical settings. Even with generalization techniques such as feature reduction and cross-validation, overfitting remains a primary concern.
Second, conventional radiomics analysis does not necessarily consider the heterogeneity of the disease of interest. Breast cancer imaging subtypes identified using a radiomics approach are distinct from established breast cancer molecular and pathological subtypes 
. This suggests that tumors in different subtypes or genotypes may have similar appearances. However, radiomic models are usually trained with either linear or deterministic models such as logistic regression or support vector machines for subtype prediction. These approaches are not ideal for the modeling of overlapping distributions.
Unsupervised clustering can be leveraged to alleviate these problems. Consensus clustering (CC), for example, has been used with radiomic features extracted from the tumor and surrounding parenchyma to define imaging subtypes in breast cancer[9, 13]. However, it is difficult to glean further insights of the imaging subtypes through clustering without additional analyses, and often non-trivial to select the optimal number of clusters.
In this paper, we alleviated the issues of overfitting and non-distinct feature distributions using a radiomics pipeline that identifies imaging subtypes based on radiomic features using a probabilistic generative model with minimum supervision. We made use of an autoencoder to reduce the dimensions of our radiomic features and find representations with minimal correlations. We incorporated a Gaussian mixture model (GMM) with minimum message length criterion (MML) to cluster patient MRIs into imaging subtypes using the learned representations of features . We compared the performance of other clustering algorithms with our approach and investigated the clinical value of the defined imaging subtypes. In addition, we demonstrated that the imaging subtypes derived from our pipeline have comparable prognostic ability to state-of-the-art clinical and imaging biomarkers.
The workflow of our unsupervised clustering radiomics analysis includes five steps. (Fig. 1). First, quantitative imaging features are extracted from MRI scans. Then an autoencoder is applied to learn feature representations for the purpose of dimension reduction. Next, the learned representations from the latent space are clustered using a GMM. Finally, the significance and clinical value of the learned imaging subtypes (clusters) are evaluated. Clinical outcome is held out during model training and only used in evaluation stage.
2.0.1 Radiomic Feature Extraction
The first step of our pipeline was MRI tumour lesion segmentation and radiomic features extraction. The MRIs were resampled to isotropic spacing [3,3,3] using B-spline transformation. A Z-score transformation was applied to normalize the resampled images and rescale their intensities to range from 0 to 100. Outliers with extreme intensity values were capped in normalization. Image discretization was performed using a bin width of 5 to simplify computation. Finally, 100 features from three categories describing the characteristics (intensity, shape and texture) of the volume of interest,e.g. tumour lesions, were extracted using pyradiomics .
2.0.2 Quantile Normalization
We observed that radiomic features extracted from medical images frequently included outliers, i.e.
samples with extreme feature values. However, in order to be able to reconstruct features from very few latent features in autoencoders, extreme values should be avoided. Experimentally we found that Z-normalization and capping did not work well for this problem. Hence, we used a quantile normalization approach with thresholds designed for radiomic features. We transform feature quantiles: 0-5% (min), 5-25%, 25%-50%, 50%-75%, 75%-95%, 95%-100% (max) to floats: 0, 1/6, 2/6, 3/6, 4/6, 5/6, 1, respectively. The quantile thresholds were empirically determined based on experiments.
2.0.3 Feature Representation Learning (with autoencoder)
With the transformed radiomic features, we leveraged an autoencoder to learn feature representations i.e. features in low dimensional latent space that best summarize the data. The input for the autoencoder was the 100-element radiomic feature vector previously mentioned (Fig. 2). There were five layers for both the encoder and the decoder. A relatively deep structure compared to the size of input was used to add non-linearity. The autoencoder was blinded to all clinical information to prevent overfitting. All layers were fully connected with SELU activation . We chose three as the number of latent features because there were three categories of radiomic features. The latent features were used as the input for subsequent unsupervised clustering.
2.0.4 Unsupervised Clustering (with GMM-MML)
Given the known extant of heterogeneity in cancer, imaging phenotypes of tumors with different genotypes/subtypes cannot be expected to be discrete. This motivated the choice of a probabilistic model instead of a deterministic one for tumor image clustering across known molecular and pathological categories. Therefore, we used a GMM for the unsupervised discovery of imaging subtypes.
Gaussian mixtures are weighted linear combinations of
-component Gaussians that are used to model a probability density. The optimum number of componentsand parameters
are estimated using minimum message length criterion (MML). The idea of MML is to simplify estimation of GMM parameters by minimization of encoding length. Consider a dataset which is generated from a probabilistic distribution , the message length required to encode and transmit is:
where is the prior and is the encoding message length.
Then and can be simultaneously estimated using the following equation:
where is the determinant of the expected Fisher information matrix .
3 Experiments and Results
3.1 Experimental Design
3.1.1 Data Description
The dataset we used is an institutional retrospective cohort of 108 patients with colorectal cancer liver metastases (CRCLM). The number of slices for each volume is between 57 and 170 (mean=100). Institutional Review Board approved the study and waived informed consent. Gadobutrol-enhanced liver MRIs, specifically 10-minute-delayed T1 MRIs were acquired after chemotherapy and prior to hepatic resection, with 1.5/3T MR systems. Tumor segmentation was performed by a single reader with 6 years of experience. The reader was blinded to all clinical information expect patient history of CRCLM. Patient mortality right-censored at three years were available.
3.1.2 Implementation Details
Radiomic feature extraction was implemented using packages pyradiomics (v2.0.0) , numpy (v1.14.3) and SimpleITK (v1.1.0) in Python (v2.7.15).
An autoencoder was trained with Keras on a NVIDIA TITAN Xp Graphics Card (with 12G memory). The loss function was set to binary cross entropy loss. We used adam optimization with a mini-batch size of 64. Default parameters were used, specifically initial learning rate of 0.001, decay rates of 0.9 and 0.999. The model was trained for 400 epochs.
GMM with MML was implemented using Python package gmm-mml (v0.11). Initial number of clusters was set to 25. Maximum iteration of 100 and convergence threshold of as used.
3.1.3 Evaluation Metrics
In radiomic subtype classification tasks, there is no “ground truth”. Thus, instead of using normalized mutual information and Rand index, which are typically used for evaluating the performances of unsupervised clustering methods, we evaluated the clinical value of our imaging clusters according to their association with patient outcome. Cox proportional hazard models were built for comparing different unsupervised clustering methods, in terms of concordance index (CI), hazard ratio (HR) and p-value. A cluster number of three was automatically selected by both GMM-MML and the heuristic algorithms provided in SIMLR (v1.8.1), so we compared the methods by the maximum pair-wise HR produced from their clustering results.
The resulting clusters from our pipeline and their clinical values were demonstrated using a kaplan-meier plot and results from a log-rank test. We also compared our approach with other unsupervised clustering methods that have been used to cluster radiomic features, including baseline consensus clustering[9, 13] and state-of-the-art SIMLR. Further, we compared our approach to validated clinical and imaging biomarkers for prognostic ability [5, 3].
|Method||Concordance index||Hazard ratio||P value|
Comparison of unsupervised learning algorithms for imaging phenotype clustering. Numbers in parentheses are confidence intervals.“*” denotes significant p value.
We estimated Gaussian mixtures of latent representations of radiomic features from liver MRIs using our pipeline. Three Gaussian distributions were learned, each representing an imaging subtype (Figure3.a-b). We compared the raw radiomic feature distributions of the three tumor subtypes. We found subtype II tumors to be larger but otherwise very similar to subtype I tumors. In contrast, tumors in subtype III were drastically different from the other two, mostly with higher heterogeneity in texture and were more elongated. There are 46, 41 and 21 patients in cluster I-III, respectively. Patients in subtype III had significantly worse survival rates than patients in the other two subtypes (Figure 3.c).
We compared our approach and other unsupervised clustering algorithms for defining imaging subtypes (Table 1). Our approach achieved the highest CI and the only statistically significant HR of 3.32 [1.35-8.18]. Therefore our approach was uniquely able to produce imaging subtypes predictive of patient survival. In addition, the estimated GMM components can be used to model theoretical distributions of CRCLM tumor appearance and be expanded to a validated training-testing model, while the subtypes defined by SIMLR and consensus clustering are hard to interpret and cannot be expanded to incorporate new samples.
We also compared the prognostic ability of our approach to other biomarkers for CRCLM (Table 2). The Fong score is a clinical risk score based on five independent preoperative risk factors built to predict patient prognosis . Target-tumor-enhancement (TTE) is an imaging biomarker specifically designed for stratifying CRCLM patients with late-gadolinium-enhanced MRI . The comparison was based on a subset of 99 patients who had all three biomarkers available. Our approach outperformed the Fong score and had comparable performance to TTE in predicting 3-year survival rates for CRCLM patients.
|Method||Hazard ratio||P value|
|Fong score ||2.28 [0.96-5.42]||0.060|
|Target tumor enhancement||4.06 [1.73-9.51]||0.001*|
4 Discussion and Conclusion
Probabilistic generative modeling for unsupervised clustering can alleviate the issues of overfitting and overlapping feature distributions in radiomics analysis. We proposed an unsupervised pipeline composed of an autoencoder and a GMM-MML for identifying imaging subtypes from radiomic features that achieved clinically meaningful results. Experiments showed that our approach outperformed other unsupervised algorithms typically used in radiomics by finding imaging subtypes that were associated with patient survival. We also demonstrated that our unsupervised model outperformed a clinical prognostic biomarker and had comparable performance to a state-of-the-art imaging biomarker designed for this specific tumor type and contrast agent.
Due to the modest sample size, we didn’t perform a train-test split for feature representation learning and Guassian mixture modeling. However, since the prediction of patient survival rate was independent of radiomic feature encoding and clustering, the survival difference between imaging subtypes were not due to overfitting in this respect. Also, our study was based on manual segmentations of tumor lesions. In future work, we will apply convolutional neural network structures which can extract features directly from MRI scans and remove the need of manual segmentations.
The underlying biological mechanisms of the identified imaging clusters can be elucidated further using pathway enrichment analysis [10, 13]. Investigations into targetable recurrent mutational or expressional changes in the subtypes may also open the way to guided personalized treatments for colorectal cancer liver metastases patients based on imaging analysis.
The authors would like to thank The Natural Sciences and Engineering Research Council of Canada (NSERC) for funding, and acknowledge the contribution of Drs. Karanicolas, Law and Coburn in helping to create the patient cohort for this study.
-  Cited by: §1.
-  (2014) Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Communications 5. External Links: Cited by: §1.
-  (2018) Late gadolinium enhancement of colorectal liver metastases post-chemotherapy is associated with tumour fibrosis and overall survival post-hepatectomy. European radiology, pp. 1–8. Cited by: §3.1.4, §3.2, Table 2.
-  (2002) Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis & Machine Intelligence (3), pp. 381–396. Cited by: §1, §2.0.4, §2.0.4.
-  (1999) Clinical score for predicting recurrence after hepatic resection for metastatic colorectal cancer: analysis of 1001 consecutive cases. Annals of surgery 230 (3), pp. 309. Cited by: §3.1.4, §3.2, Table 2.
-  (2017) Metabolic radiomics for pretreatment 18 f-fdg pet/ct to characterize locally advanced breast cancer: histopathologic characteristics, response to neoadjuvant chemotherapy, and prognosis. Scientific reports 7 (1), pp. 1556. Cited by: §1.
-  (2017) Self-normalizing neural networks. In Advances in neural information processing systems, pp. 971–980. Cited by: §2.0.3.
-  (2018) Radiomic phenotypes of mammographic parenchymal complexity: toward augmenting breast density in breast cancer risk assessment. Radiology 290 (1), pp. 41–49. Cited by: §1.
-  (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning 52 (1-2), pp. 91–118. Cited by: §1, §3.1.4, Table 1.
-  (2019) Pathway enrichment analysis and visualization of omics data using g: profiler, gsea, cytoscape and enrichmentmap. Nature protocols, pp. 1. Cited by: §4.
-  (2017) Computational radiomics system to decode the radiographic phenotype. Cancer research 77 (21), pp. e104–e107. Cited by: §2.0.1, §3.1.2.
-  (2018) SIMLR: a tool for large-scale genomic analyses by multi-kernel learning. Proteomics 18 (2), pp. 1700232. Cited by: §3.1.3, §3.1.4, Table 1.
-  (2017) Unsupervised clustering of Quantitative image phenotypes reveals breast cancer subtypes with distinct prognoses and Molecular pathways. Clinical Cancer Research 23 (13), pp. 3334–3342. External Links: Cited by: §1, §1, §3.1.4, §4.