Colorectal cancer (CRC) is the fourth most common non-skin cancer and the second leading cause of cancer deaths in the United States . 50-70% of CRC patients eventually develop liver metastases, which becomes the predominant cause of death . Currently, hepatic resection is the only potential cure to colorectal cancer liver metastases (CRLM), providing up to 58% 5-year survival rate, compared to 11% in patients who had chemotherapy only [5, 4]. Stratifying CRLM patients can potentially lead to better selection of patients for hepatic resection and improvements in their prognosis.
Many biomarkers have been proposed to stratify CRLM patients. Main-stream clinical risk scores utilize clinical factors (number of metastases, lymph node status, Carcinoembryonic antigen level, etc.) to make predictions on outcomes and assist in treatment planning, but none are strong predictors of long-term survival . As medical images, specifically Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) images are routinely collected in the diagnosis and treatment planning of CRLM and contain complementary information to clinical variables, a variety of imaging signatures have also been developed. They are typically built on texture features extracted from the tumors, peritumoral parenchyma, and/or whole livers, and they have shown promising prognostic performance for CRLM patients[13, 2, 19]. Intensity features, such as tumor enhancement in delayed contrast-enhanced MRI, have also been reported to be correlated with survival after hepatic resection. Most medical-imaging-based prognostic models (not limited to CRLM) focus solely on the largest tumor lesion. Incorporating features from multiple lesions in predictive modeling is difficult because the mechanism of how each tumor contributes to multifocal patient outcome is unknown. However, as 90% of CRLM present with multifocal metastases, a stratification model that takes all lesions into consideration is highly desirable.
In CLRM and other multifocal tumours, studies have demonstrated that multifocality indeed provides additional prognostic information. Tumor burden score (TBS) is a recent prognostic marker for CLRM that is calculated from the number of tumors and the size of the largest lesion . TBS is shown to outperform maximum tumor size (as a biomarker) by a large margin, which indicates that multifocality may play an important role in the prognosis of CRLM patients. Huang et al. found that the secondary prostate tumors in multifocal prostate cancer can have higher Gleason scores than the largest tumors, indicating that the smaller tumors can be more aggressive and drive patient outcome. Follwell et al. investigated a list of clinical factors for brain metastases and found that cumulative tumour volume (i.e. summed volume of multifocal lesions) was the only factor significantly associated with survival. In this paper, we address multifocal cancer outcome prediction with a radiomics-based multiple instance learning framework.
Radiomics is an emerging field where mathematical summarizations of tumors are extracted from medical images for use in modeling. Radiomics approaches have shown promising results in various tasks and imaging modalities. However, for the analysis of multifocal cancers, radiomic models usually use features from the largest tumor, resulting in the loss of information 
. Recently, deep learning-based workflows have been developed to improve upon conventional radiomics, yet building a generalizable deep learning-based radiomics model requires huge datasets that are currently impractical in medical imaging.
Multiple instance learning (MIL) is a class of machine learning algorithms that learn from a bag of instances, where labels are available at the bag-level but not at instance level. MIL has been widely adopted in medical image and video analysis as it performs well in weakly-supervised situations . Another advantage of MIL is that it provides interpretability, such as the importance of each instance, when instance-based or attention-based approaches are used . In most cases, detection or prediction using medical images is modeled as a MIL problem where the whole image (or 3d scan) is a bag and image patches (or cubes) cropped from the image are instances. Our work differs from previous MIL studies in treating multifocal tumors as instances. This allows us to generate more accurate instance representations (as image patches can sometimes contain redundant or irrelevant information) as well as simplify the MIL problem by reducing the number of instances.
In this paper, we hypothesize that using features from all lesions of multifocal cancer will improve outcome prediction. To the best of our knowledge, our paper is the first that specifically addresses outcome prediction in multifocal cancer with medical imaging and machine learning. We propose an end-to-end autoencoder-based multiple instance neural network (AMINN) that predicts multifocal CRLM patient outcome based on radiomic features extracted from contrast-enhanced 3D MRIs. This model is jointly trained to reconstruct input radiomic features with an autoencoder branch and to make prediction with a multiple instance branch, which allows for learning prognosis-relevant representations of tumors. The multiple instance neural network then aggregates the representations of all tumors of a patient for survival prediction. In addition, we propose a feature transformation technique tailored to radiomic features to suit the need of deep learning models. Ablation experiments were performed to show the effect of all the components of our framework.
We propose an end-to-end deep learning framework for outcome prediction in multifocal CRLMs (Fig. 1
). Each individual lesion was manually segmented and radiomic features were extracted based on the segmentations. The network consists of two branches: an autoencoder for feature reduction and a multiple instance neural network for outcome prediction. The bottleneck layer is shared by the two branches and is taken by the multiple instance network as input.
We formulated the task of patient outcome prediction as a MIL problem as follows:
is the probability of event(3-year overall survival) for patient and is a bag of instances representing a varying number of tumors in patient . represents the encoding function which maps input features to latent representations . represents a transformation function parameterized by the neural network for extracting information from instance-level representations and is a permutation invariant multiple instance pooling function that aggregates instance representations to obtain bag representations and bag labels.
2.0.2 Radiomic feature extraction
We extract radiomic features from preoperative 10-minute delayed gadobutrol-enhanced 3D MRI scans. Tumors are delineated slice by slice by a radiologist with 6 years of experience in abdominal imaging. Resampling, normalization and discretization were applied to all scans for preprocessing. The open source radiomics package, pyradiomics, is used for feature extraction. For each tumor lesion, we extract 100 features from three categories to mathematically summarize its intensity, shape and texture properties, including 18 first-order features, 14 shape features, 22 gray-level co-occurrence matrix features, 16 gray-level run length matrix features, 16 gray-level size zone matrix features and 14 gray-level dependence matrix features.
2.0.3 Feature normalization
Normalization is a common technique in machine learning for both correcting biases in feature scales and speeding up training. Ideally, for deep learning frameworks, input data should be scaled, centered at zero and relatively symmetrically distributed around mean to alleviate oscillations and avoid vanishing (due to packed points) or exploding (due to extreme values) gradients in gradient descent 
. Z-score normalization, one of the most popular normalization techniques in radiomic analysis used to achieve this purpose, is less informative for data that is not approximately normally distributed. We observe that the distribution of most of our radiomic features are instead severely skewed. As a result, after Z-score normalization of tumor volume for example, above 90% of samples are packed in the range of [-0.5, 0], while the largest lesions have Z-value up to 8 (Figure 2). This is undesirable as tumor volume is correlated with many radiomic features and is an important predictor of patient outcome.
We utilize a simple two-step normalization algorithm that combines logarithmic transformation and Z-score normalization to tackle this problem:
where is the th input feature of the th sample. A logarithmic transformation is first applied to reduce skewness. To handle negative and zero values, we incorporate a feature-specific constant, , which accounts for the range of each individual feature and is more appropriate than applying a single constant (i.e. 0.01 or 1) to the transformations of all features. The first term makes sure that the value is larger than or equal to zero, while adding aims to avoid taking log operation on extremely small numbers by adding a constant based on the original scale of the . This is followed by a standard Z-score transformation to ensure all features have similar ranges and are centered around zero (Fig. 2).
2.0.4 Autoencoder structure
The autoencoder applied to reduce the dimensionality of input radiomic features is composed of a four-layer encoder and a four-layer decoder. All layers are fully-connected layers with ReLU activation except the last layer, which uses sigmoid activation. While autoencoders can be employed to select radiomic features by learning their low-dimensional representations in an unsupervised manner, we wish to directly extract the most relevant latent representation for patient prognosis. Thus the bottleneck layer of the autoencoder is connected to the multiple instance prediction network as input in order to select latent features that are informative of patient outcome.
2.0.5 Multiple instance neural network structure
The MIL network consists of 3 fully connected layers followed by a MIL pooling layer that integrates instance representations to predict patient label. Each fully connected layer consists of 32 hidden units. Various pooling methods are tested with the same backbone network[22, 25]:
The selection of MIL pooling functions usually depends on the specific task. Since it is not clear how multiple tumors collectively determine clinical outcome, apart from max, average and lse (log-sum-exp, a smooth approximation of max), we also test attention-based pooling (att), where an auxiliary network is applied to learn the weights of each instance. The att model can be more flexible and interpretable by incorporating a trainable pooling module that learns the importance of each instance .
3.0.1 Data Description
We used a retrospective cohort of 108 patients who had colorectal cancer liver metastases (CRLM) and were treated at our institution. Informed consent was waived by the institutional review board. The patients received Gadobutrol-enhanced magnetic resonance imaging before hepatic resection. Among the 108 patients who were eligible for hepatic resection, 50 of them had multifocal CRLM. There were 181 individual lesions, leading to an average of 3.6 lesions (range 2-17) per patient. The 10-minute delayed T1 sequence was used for this study. Tumor segmentation was performed by a radiologist with 6 years of experience. Patient overall survival time has median of 30 months and is right-censored to obtain 3-year survival. 19 out of 50 patients had an event during the censored period.
3.0.2 Implementation Details
Radiomic features were extracted using pyradiomics (v2.0.0) in Python (v3.6.5). The MRIs were resampled to [1.5, 1.5, 1.5] isotropic spacing using B-spline interpolation. Images were normalized using Z-score normalization with 99th percentile suppression then rescaled to range [0, 100] and discretized with a bin size of 5.
Keras was used to train the deep learning models on a NVIDIA TITAN Xp Graphics Card. We used the unifocal subset of our dataset as an independent cohort for hyper-parameter tuning. Grid searching was performed to determine the hyper-parameters for the proposed network, on number of epochs [40, 60, 80, 100], initial learning rate [0.001, 0.0001, 0.00001], number of hidden units [16, 32, 64] and dropout rate [0, 0.3, 0.5]. Over 10 repeated 3 cross-fold validations, the model with the best performance was trained with 32 hidden units in each fully-connected layer in 100 epochs without dropout. We apply these hyper-parameters directly in the training of AMINN on our multifocal CRLM cohort. The model was optimized using Adam optimization with an initial learning rate of 0.0001, beta_1 of 0.9 and beta_2 of 0.999, on a compound loss – mean squared error loss for reconstruction of the autoencoder plus binary cross-entropy loss from the bag-level prediction of the multiple instance neural network:
3.0.3 Evaluation strategy
We compared our approach with other machine learning methods that are commonly used for predicting patient survival from radiomic features in terms of area under the ROC curve (AUC) and accuracy (ACC) in 10 repeats of 3-fold cross validation, stratified by patient outcome. Our dataset is slightly imbalanced (19 positive outcomes vs 31 negative outcomes) so we choose AUC as the primary evaluation criteria. We built support vector machine (SVM), random forest (RF), logistic regression (LR) models based on features extracted from the largest lesion and we selected features using least absolute shrinkage and selection operator (LASSO) before fitting the models. We also built a unifocal version of our AMINN model for comparison by simply stacking the 4-layer encoder and the 3-layer fully-connected prediction network. For multiple instance prediction using AMINN, we tested four different settings.
We performed a series of ablation experiments to evaluate the effect of each component in our network. Specifically, we evaluated all the combinations of: using features from all lesions (multi) or features from the largest lesion; applying an autoencoder for feature reduction (ae) or a naive fully-connected multiple instance learning network; performing two-step normalization (log) or Z-score normalization only. In addition, we also evaluated the effect of our two-step normalization technique on the baseline traditional machine learning models. The ablation experiments were also carried out in 10 repeated 3-fold cross-validations.
For each fold in our 3-fold cross-validation, we predict the outcome for the test fold based on a AMINN model trained on the other two folds. We then combine the outputs of each fold to obtain the model predictions for all patients, which represents a continuous score for the probability of 3-year survival. We also obtain a binary risk score by median dichotimizing the continuous model outputs. We compared our binary score against other clinical and imaging biomarkers, including the Fong risk score, a clinical risk score that is applied in the clinic for treatment planning , the Tumor burden score, a recent clinical risk score developed to take number of CRLM lesions into prognosis consideration, and target tumor enhancement, a CRLM biomarker specifically designed for 10-minutes delayed gadobutrol-enhanced MRIs [7, 16, 3]. These methods are evaluated by calculating their concordance-index (c-index) with time to event and hazard ratio in univariate cox proportional hazard regression models. In addition, a multivariable cox model is used to test the significance and the added prognostic value of our risk score.
|AMINN*||Average||0.698 0.080||0.677 0.087|
0.05 in t-test) when compared with baseline methods
The proposed autoencoder-based multiple instance neural network (AMINN) outperforms baseline machine learning algorithms by a large margin (Table 1
). Among the four multiple instance pooling methods, average pooling has the best performance in terms of area under the ROC curve (AUC) and accuracy. Attention-based pooling has suboptimal performance (6.7% lower than average pooling in AUC), suggesting that more data are required for the proposed framework to learn the lethality (weights) of each lesion. Compared to commonly used approaches in conventional radiomics (i.e. machine learning algorithms paired with LASSO feature selection based on the largest tumor), AMINN with average pooling achieved a 19.5% increase in AUC (p0.001, t-test). The unifocal version of AMINN, which has the same structure as AMINN but only uses features from the largest tumor has similar performance as other baseline methods and is significantly worse than the proposed multiple instance framework.
We then tested the performance gain from each component of AMINN (Table 2). Incorporating all lesions into outcome prediction (denoted by multi) consistently improved AUC, by 5.2%, 6.5%, 13.8% and 13.5% compared to AMINN with baseline setting, with added two-step normalization, with added autoencoder, and added two-step normalization and autoencoder, respectively. Notably, adding two-step normalization also boosts performance by 7.7% to 10.7% from versions of AMINN using only Z-score normalization. In contrast, incorporating two-step normalization results in similar AUCs compared to Z-score normalization for baseline non-neural network algorithms (for conciseness, only results for logistic regression are shown). This supports the hypothesis that two-step normalization improves the training of neural networks with radiomic features as inputs. The autoencoder component decreases AUC in both versions of AMINN that use only the largest lesion but improves AUC when multiple lesions are considered. One possible explanation is that the autoencoder overfits quickly in the unifocal models.
When compared to clinical and imaging biomarkers for CLRM, our risk score is the only one that achieved statistical significance in univariate cox regression modeling for our cohort (Table 3
). It also achieved the highest concordance index (c-index) of 0.63 and a hazard ratio of 2.88 (95% confidence interval (CI): 1.12-7.74). In multivariable analysis where all four biomarkers were included as predictors, our risk score remains the only one showing significant predictive value in our cohort, while all other biomarkers were not independently informative. We also tested the clinical and imaging biomarkers on our entire cohort of 108 patients, which includes 58 unifocal and 50 multifocal cases. The performance of all models improved when applied to the entire cohort and both tumor burden score (TBS) and target tumor enhancement (TTE) become statistically significant for survival prediction. Although both TBS and the Fong score each have an inherent component that takes into account the number of lesions, they fail to further stratify only multifocal patients since all patients have elevated risk due to multiple tumours. Similarly, although TTE uses the intensity feature of the two largest lesions when there are multiple tumours, it has less predictive value when all patients have more than two tumors. The gap in predictive power of these biomarkers between predicting on the whole cohort and the multifocal subset highlights the need for developing multifocal cancer prediction models.
|Univariate analysis||Multivariate analysis|
|Method||HR (95% CI)||C-index||p-value||HR (95% CI)||p-value|
|Fong score ||1.44 (0.60-3.44)||0.55||0.41||1.19 (0.45-3.15)||0.47|
|TBS ||1.73 (0.77-3.88)||0.58||0.18||2.59 (0.96-7.01)||0.06|
|TTE ||2.29 (0.86-6.11)||0.59||0.10||1.72 (0.63-4.76)||0.29|
|AMINN||2.88 (1.12-7.44)||0.63||0.03||4.22 (1.38-12.88)||0.01|
The primary limitation of our study is its small sample size. Preoperative multifocal CRLM scans are difficult to acquire because most multifocal patients are not eligible for surgery and have poor survival . To the best of our knowledge, there is no public imaging dataset for outcome prediction in multifocal patients in any tumor type. One of our goals for this study is to draw more attention to multifocal diseases in the medical image computing community.
In our study, we used manual segmentations generated by a radiologist. To avoid inter-/intra-observer variability and save annotation time, it is ideal if automatic segmentation models, for instance, the nnUNet  can be used to perform tumor segmentation. However, although some deep learning-based approaches have achieved exciting performance in segmenting the liver, their performance for liver tumor segmentation in MRI is less satisfactory. Similarly, we did not apply whole image-based deep learning because the features learned can be biased if the network cannot detect or segment lesions accurately.
Currently, there are two predominant classes of radiomic analyses: 1. “conventional” or “hand-crafted” radiomics, where pre-defined quantitative features are extracted and traditional machine learning algorithms are used for feature selection and modeling; 2. “deep radiomics”, where end-to-end deep neural networks are used to extract features and make predictions. We used a slightly different approach, which takes advantage of the robustness of hand-crafted features in relatively small datasets and the strong learning capability of neural networks to select features and make classifications. Although it seems redundant because radiomic features and neural networks can both serve as feature extractors, our results empirically show that with appropriate transformation, hand-crafted features can be good inputs for neural networks when there is not enough data for training a deep learning model on images directly.
In conclusion, we proposed an end-to-end autoencoder-based multiple instance neural network (AMINN) to predict outcomes of multifocal CRLM patients from radiomic features extracted from contrast-enhanced MRIs. By incorporating information from all tumors for patient outcome prediction, our model achieved a significant 19.5% increase in AUC and 9.5% increase in accuracy compared to commonly used methods. The risk score built from the outputs of our network outperforms other clinical and imaging biomarkers and remains the only significant biomarker in a multivariable cox model. We also show the impact of our two-step normalization technique in improving the performance of our radiomic-feature-based neural network.
In future work, we will seek for external and larger datasets to validate our approach. We will also focus on interpreting the importance/weight of each lesion in multifocal patients based on the multiple instance learning framework, because this will shed light to the etiology of multifocal cancer and provide decision support for the management of borderline resectable multifocal patients. In addition, as we didn’t add CRLM-specific assumptions in our model, we believe this pipeline can be extended to other imaging modalities and other multifocal diseases, for example, bone, lung and brain metastases.
-  (2019) From handcrafted to deep-learning-based cancer radiomics: challenges and opportunities. IEEE Signal Processing Magazine 36 (4), pp. 132–160. Cited by: §1, §5.
Unsupervised clustering of quantitative imaging phenotypes using autoencoder and gaussian mixture model. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 575–582. Cited by: §1, §2.0.4.
-  (2018) Late gadolinium enhancement of colorectal liver metastases post-chemotherapy is associated with tumour fibrosis and overall survival post-hepatectomy. European radiology, pp. 1–8. Cited by: §1, §3.0.3, Table 3, Table 4.
-  (2004) Five-year survival after resection of hepatic metastases from colorectal cancer in patients screened by positron emission tomography with f-18 fluorodeoxyglucose (fdg-pet). Annals of surgery 240 (3), pp. 438. Cited by: §1.
-  (2011) Durable complete responses in metastatic colorectal cancer treated with chemotherapy alone. Clinical colorectal cancer 10 (3), pp. 178–182. Cited by: §1.
-  (2012) Volume specific response criteria for brain metastases following salvage stereotactic radiosurgery and associated predictors of response. Acta oncologica 51 (5), pp. 629–635. Cited by: §1.
-  (1999) Clinical score for predicting recurrence after hepatic resection for metastatic colorectal cancer: analysis of 1001 consecutive cases. Annals of surgery 230 (3), pp. 309. Cited by: §3.0.3, Table 3, Table 4.
-  (2016) Radiomics: images are more than pictures, they are data. Radiology 278 (2), pp. 563–577. Cited by: §1.
-  (2014) Re-evaluating the concept of “dominant/index tumor nodule” in multifocal prostate cancer. Virchows Archiv 464 (5), pp. 589–594. Cited by: §1.
-  (2018) Attention-based deep multiple instance learning. In International conference in machine learning, Cited by: §2.0.5, Table 1.
-  (2018) Nnu-net: self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486. Cited by: §5.
-  (1998) A framework for multiple-instance learning. In Advances in neural information processing systems, pp. 570–576. Cited by: §1.
-  (2020) MRI findings of liver parenchyma peripheral to colorectal liver metastasis: a potential predictor of long-term prognosis. Radiology, pp. 202367. Cited by: §1.
-  (2017) Multiple-instance learning for medical image and video analysis. IEEE reviews in biomedical engineering 10, pp. 213–234. Cited by: §1.
-  (2014) Performance of prognostic scores in predicting long-term outcome following resection of colorectal liver metastases. British Journal of Surgery 101 (7), pp. 856–866. Cited by: §1.
-  (2018) The tumor burden score: a new “metro-ticket” prognostic tool for colorectal liver metastases based on tumor size and number of tumors. Annals of surgery 267 (1), pp. 132–141. Cited by: §1, §3.0.3, Table 3, Table 4.
-  (2000) Reducing prediction error by transforming input data for neural networks. Journal of computing in civil engineering 14 (2), pp. 109–116. Cited by: §2.0.3.
-  (2019) Cancer statistics, 2019. CA: a cancer journal for clinicians 69 (1), pp. 7–34. Cited by: §1, §1.
-  (2017) Computed tomography image texture: a noninvasive prognostic marker of hepatic recurrence after hepatectomy for metastatic colorectal cancer. Annals of surgical oncology 24 (9), pp. 2482–2490. Cited by: §1.
-  (2017) Hepatic metastasis from colorectal cancer. Euroasian journal of hepato-gastroenterology 7 (2), pp. 166. Cited by: §1, §5.
-  (2017) Computational radiomics system to decode the radiographic phenotype. Cancer research 77 (21), pp. e104–e107. Cited by: §2.0.2.
-  (2018) Revisiting multiple instance neural networks. Pattern Recognition 74, pp. 15–24. Cited by: §2.0.5.
-  (2019) Vulnerabilities of radiomic signature development: the need for safeguards. Radiotherapy and Oncology 130, pp. 2–9. Cited by: §2.0.3.
-  (2016) Applications and limitations of radiomics. Physics in Medicine & Biology 61 (13), pp. R150. Cited by: §1.
-  (2019) Handbook of medical image computing and computer assisted intervention. Academic Press. Cited by: §1, §2.0.5, §2.0.5.