What does AI see? Deep segmentation networks discover biomarkers for lung cancer survival

03/26/2019 ∙ by Stephen Baek, et al. ∙ 14

Non-small-cell lung cancer (NSCLC) represents approximately 80-85 cancer diagnoses and is the leading cause of cancer-related death worldwide. Recent studies indicate that image-based radiomics features from positron emission tomography-computed tomography (PET/CT) images have predictive power on NSCLC outcomes. To this end, easily calculated functional features such as the maximum and the mean of standard uptake value (SUV) and total lesion glycolysis (TLG) are most commonly used for NSCLC prognostication, but their prognostic value remains controversial. Meanwhile, convolutional neural networks (CNN) are rapidly emerging as a new premise for cancer image analysis, with significantly enhanced predictive power compared to other hand-crafted radiomics features. Here we show that CNN trained to perform the tumor segmentation task, with no other information than physician contours, identify a rich set of survival-related image features with remarkable prognostic value. In a retrospective study on 96 NSCLC patients before stereotactic-body radiotherapy (SBRT), we found that the CNN segmentation algorithm (U-Net) trained for tumor segmentation in PET/CT images, contained features having strong correlation with 2- and 5-year overall and disease-specific survivals. The U-net algorithm has not seen any other clinical information (e.g. survival, age, smoking history) than the images and the corresponding tumor contours provided by physicians. Furthermore, through visualization of the U-Net, we also found convincing evidence that the regions of progression appear to match with the regions where the U-Net features identified patterns that predicted higher likelihood of death. We anticipate our findings will be a starting point for more sophisticated non-intrusive patient specific cancer prognosis determination.



There are no comments yet.


page 1

page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


According to World Health Organization (WHO), lung cancer remains the most common, leading cause of cancer-related death worldwide with 2.1 million new cases diagnosed and 1.8 million deaths in 2018[2]. NSCLC accounts 80-85% of lung cancer diagnoses[1] and five-year survival rate of NSCLC remains relatively low (23%), compared to other leading cancer sites such as colorectal (64.5%), breast (89.6%), and prostate (98.2%)[19]. Historically, the tumor, nodes, and metastases (TNM) staging system has served as the major prognostic factor in predicting therapeutic outcomes, but it does not differentiate responders and non-responders in the same stage[20]. The maximum and the mean of standard uptake values ( and ) have been reported for their correlation with survival[6, 7, 8] but are of limited clinical value due to their unsatisfactory predictive power and lack of robustness[14, 15]. Other prognostic markers have also been studied, including TLG, which incorporates metabolic tumor volume (MTV) and metabolic activity (TLG = MTV ). Reports[11, 12, 13] suggest that TLG may have better prognostic power than or . These metrics, however, are not optimal and do not provide a comprehensive image-based analysis of tumors[21]. More recently, radiomics approaches, which employ semi-automated analysis based on a few hand-crafted imaging features describing intratumoral heterogeneity, demonstrated higher prognostic power[22, 23]. However, these features still have limited predictive power ranging between 0.5 and 0.79 in terms of the area under the curve (AUC)[23, 24, 25]. Recent literature in deep learning demonstrates its strong potentials in cancer prognostication[16, 26], however the clinical implications of deep learning remain questioned due to the limited interpretability of CNNs.

Here, we propose an interpretable and highly accurate framework to solve this problem by capitalizing on the unprecedented success of deep convolutional neural networks (CNN). More specifically, we investigate the U-Net[18], a convolutional encoder-decoder network that has demonstrated exceptional performance in tumor detection and segmentation tasks. Illustrated in Fig, 1a, these networks take a three-dimensional (3D) volume image as an input, processes it through a “bottleneck layer” where the image features are compressed, and reconstructed into a binary segmentation map indicating a pixel-wise tumor classification result. Here, we focused on the information encoded at the bottleneck layer which contains rich visual characteristics of the tumor and reasoned that the encoded information at this layer might be relevant to the tumor malignancy and, thus, cancer survival, which is the central hypothesis of this paper.

Figure 1: Overview of the study. (a) Schematic diagram of U-Net. Schematic diagram of the U-Net segmentation network. The U-Net is trained with PET/CT images and corresponding physician contours but without survival-related information. The “dimensional bottleneck” at the middle of the U-Net produces latent variables summarizing image features (55,296 features from CT + 55,296 features from PET), which we hypothesize to be relevant to cancer survival. These features are then used for survival prediction when a new patient arrives. (b) Summary statistics of the data set used in this study.

In a prior study[27, 28], we analyzed PET/CT images of 96 non-small cell lung cancer (NSCLC) patients that were obtained within 3 months prior to stereotactic body radiation therapy (SBRT), whose summary statistics are illustrated in Fig. 1b. For each volume image, the region of interest (ROI) with a dimension of 96 mm 96 mm 48 mm was set around each tumor location and the image was cropped to the ROI volume. Two separate U-Net models were trained to perform tumor segmentation in PET and CT images, respectively. Each of the models were supervised with the corresponding physician contours, but no other information such as survival time was provided. After training, each U-Net model learned to encode 55,296 features at the bottleneck layer for each patient, resulting a total of 110,592 features per patient.

These features are an intermediate throughput of the U-Net, which are then decoded to generate an automated segmentation in the network. Hence, it is likely that these features summarize some rich structural and functional geometry of the intratumoral and peritumoral area, some of which might be relevant to cancer survival. To test this proposition, we conducted a two-sample -test and examined if there were any statistically significant features from the U-Net that distinguish the survival and death groups. The -test analysis was conducted for four different categories of survival separately, namely 2-year overall survival (2-yr. OS), 5-year overall survival (5-yr. OS), 2-year disease-specific survival (2-yr. DS), and 5-year disease-specific survival (5-yr. DS). The analysis revealed that there were on average 3,042 features in CT and 2,908 features in PET that had the -value below 0.05, as illustrated in Fig. 2a. Moreover, the analysis revealed that there was a group of 299 features in CT and 292 features in PET that were commonly observed across the four survival categories as illustrated in Fig. 2b, suggesting that there were indeed strong survival-related markers in the U-Net learned features.

Figure 2: Feature selection. (a) Log-scaled histogram of -values for each survival category. Roughly 2,000 features in each category have -value less than 0.05. (b) Intersection between different survival categories. ‘NS’ indicates not significant features. Black dots are features that were significant () in all four categories. There were total 139 and 292 such features in CT and PET, respectively, but only 10 of them with the smallest -values in all four categories were plotted. (c) Distribution of -values among LASSO-selected features and across the others. Dashed line is the cutoff value ().

We therefore conducted a separate analysis to select features via the least absolute shrinkage and selection operator (LASSO). The analysis was divided into four independent experiments and, for each of the experiments, LASSO attempted to select a few features that have a strong relationship with one of the four survival categories using the linear logistic regression model. For more rigorous selection of features, the inclusion probability was computed via bootstrapping, instead of one-shot selection. In the descending order of inclusion probability, we selected the top 20 features for each survival category, which resulted, in total, 73 features in CT and 56 features in PET across all categories, without double-counting the intersection. The top 20 features had the inclusion probability greater than 0.3 and they reported noticeably smaller

-values than the other features as illustrated in Fig. 2c, which reconfirms the existence of strong survival-related radiomic markers in the U-Net learned features.

Here, it is worth reemphasizing that the U-Net was trained without any survival-related information and, hence, it is highly unlikely that the U-Net-learned features were overfitted to the survival data or biased towards them. Yet, while these U-Net features were identified independently from survival data, the U-Net features demonstrated a strong evidence of correlation and hence prognostic power for NSCLC survival as discussed above. To quantify this, we performed linear logistic regression on each of the four survival categories and measured the accuracy, sensitivity, specificity, and receiver operating characteristics (ROC). The top 20 LASSO-selected features were used as independent variables in each category and no other covariates provided. For rigorous validation, bootstrapping was employed to compute the 95% confidence interval (CI) of each performance measure. Each bootstrap sample was further split into a training set and test set and reported in Fig. 

3 as the average values across test sets in different bootstrap samples. The contrast in performance was clear between the U-Net features and the conventional imaging features, proving the strong prognostic power of the U-Net features quantitatively.

Figure 3: Prognostic performance of the U-Net features. There are four survival categories being tested: 2-year overall survival (2-yr. OS), 5-year overall survival (5-yr. OS), 2-year disease-specific survival (2-yr. DS), and 5-year disease-specific survival (5-yr. DS). The U-Net features are compared with the conventional TLG marker and 17 radiomic features defined as in Oikonomou et al.[4]. Reported values are average performance scores across bootstrap samples and the numbers in parentheses are their 95% confidence bands. (a) Overall prediction accuracy (proportion of the correct prediction over the entire data set). (b) Sensitivity (correct prediction of death over all death cases). (c) Specificity (correct prediction of survival over all survival cases). (d) AUC of the receiver operating characteristics (ROC) curve.

Meanwhile, the LASSO-selected U-Net features were further studied to shed a light on their clinical implications and intuitive meanings. We first tested the correlation between the LASSO-selected U-Net features and the conventional radiomic features to see if the U-Net features were capturing survival-related markers that were previously known to be effective. We found features C16704 and P15398 had some correlation with TLG (=0.25, 0.28 and =0.02, 0.01, respectively), where we name the U-Net features with a prefix ‘C’ or ‘P’ to indicate which imaging modality (CT or PET) they come from followed by their indices. We also found that there were 16 features in CT and 18 features in PET that had noticeable correlation (0.05) with the 17 conventional radiomic features defined as in Oikonomou et al.[4] This might indicate that the U-Net features somehow capture the aspects of conventional radiomic features, while having a substantially larger amount of additional information coded into them.

On the other hand, these U-Net features are essentially artificial neurons in deep neural networks. Hence, we may develop some additional insight on the U-Net features by visualizing which patterns the corresponding neurons are looking for. Intuitively, one can show many different three-dimensional image patterns to the U-Net encoder and observe which image pattern activates each neuron the most. To facilitate this process, we employed an optimization-based approach

[29] where the objective is to maximize an individual neuron’s activation value by manipulating the input image pattern:


where is the U-Net encoder with the trained model parameters and , and is the input image pattern. Displayed in Fig. 4a are different image patterns that activated the survival-related U-Net features. Many of the U-Net features appear to be capturing tumor-like blobs (e.g. C00048, C25988, P39051, P47258) or textural characteristics (e.g. C08680) in the image. Interestingly, some of the U-Net features, for example C01777 and C37399, were looking for tube-like structures nearby the tumor-like blobs, which might be capturing blood vessels and lymphatics in the peritumoral area. This is, indeed, consistent with the widely accepted clinical knowledge that tumors can show enhanced growth towards vessels and lymphatics nearby as they carry nutrition to supply the tumoral growth.

Moreover, we also visualized which regions in the patient images predicted low survival probability. We employed a guided gradient backpropagation approach

[30]. The main idea of the guided backpropagation algorithm is to compute where is the probability of death and is a voxel value at the position in the patient image. The gradient can be interpreted as the change of the death probability when the voxel changes to a different value. If the voxel was not so significant in predicting death, the gradient value would be small, where as if the voxel played an important role for predicting high probability of death, the gradient value would be greater. Displayed in Fig. 4b are heatmaps representing the gradient. Heated regions (red) are the areas that lowered the probability of survival whereas the other areas (blue) are the ones that had negligible effect on the survival. In all cases, tumoral regions were highlighted in red, which might be trivial. However, through comparison with post-therapeutic images and clinical records of the patients, we observed that some of these heated regions outside of the tumoral volume overlap with the regions of progressions (see Fig. 5), demonstrating a convincing potential of the visualization method for the purpose of patient-tailored therapeutic planning in the future. However, this awaits a more rigorous and quantitative follow-up.

Figure 4: Visualization of the U-Net features. (a) Image patterns that are captured by each survival-related U-Net feature. During training, CNNs essentially learn “templates” from training images and apply these templates to analyze and understand images. Displayed here are some of these templates that the U-Nets have captured for the segmentation task originally, but discovered in our study to be relevant to cancer survival. For example, C37399 appears to be a template looking for a tumor-like shape at the top-right corner and a tube-like structure at the bottom-left. In addition, C08680 appears to look for a textural feature of the tumor. (b) Regions that predicted death of the patients obtained via a guided backpropagation method[30]. Trivially, tumoral regions are highlighted in red in the heatmap. However, some of the heated regions outside of the tumoral volume matched with the actual locations of recurrences and metastases when they were compared with the post-therapeutic images and clinical records, rendering a great potential as a practical, clinical tool for patient-tailored treatment planning in the future.
Figure 5: Correlation between U-Net visualization and cancer progression. Post-therapeutic images were compared with the U-Net visualization results. We observed an agreement of the heated regions with the actual location of recurrence. (a) Axial slice from a primary (pre-therapeutic) CT image of a patient case IA001765. The slice is below the gross tumor volume so the tumor is not visible. (b) Corresponding U-Net visualization. There are two highlighted regions in the heat map. (c

) Post-therapeutic CT image of the same patient. Dashed box indicates the estimated corresponding ROI to the primary CT slices. The heat map in (b) is superimposed to the ROI. Notice that the recurrence location coincides with the heated area.

In summary, we discovered that the U-Net segmentation algorithm trained for automated tumor segmentation on PET/CT was codifying rich structural and functional geometry at the bottleneck layer such that these codified features could be used for survival prediction in cancer patients even though the U-Net was trained without any survival-related information. The survival model based on such U-Net features demonstrated significantly higher predictive power than conventional PET-based, metabolic burden metrics such as TLG or relatively recent hand-crafted radiomics approaches. The validity of such discovery was confirmed by several statistical tests. Furthermore, we visualized the survival-related U-Net features and observed that they were indeed depicting intratumoral and/or peritumoral structures that had been previously acknowledged as potentially relevant to cancer survival. Our approach awaits a further validation against a larger number of observations and in a larger variety of cancer types. Also, there was not enough clinical evidence to conclude that the visualization of the U-Net features may identify potential regions of recurrence and metastasis and, thus, a follow-up study is suggested. However, our findings may be a new starting point for quantitative image-based cancer prognosis with a great deal of potentially important new knowledge yet to be discovered.


Research reported in this publication was supported by the National Cancer Institute (NCI) of the National Institutes of Health (NIH) under award number 1R21CA209874 and partially by U01CA140206 and P30CA086862.


  • [1] American Cancer Society. Non-small cell lung cancer. https://www.cancer.org/cancer/non-small-cell-lung-cancer/about/what-is-non-small-cell-lung-cancer.html. Accessed: 2019-02-27.
  • [2] World Health Organization. Cancer fact sheet. https://www.who.int/news-room/fact-sheets/detail/cancer (2018). Accessed: 2019-02-27.
  • [3] Satoh, Y., Onishi, H., Nambu, A. & Araki, T. Volume-based parameters measured by using FDG PET/CT in patients with stage I NSCLC treated with stereotactic body radiation therapy: Prognostic value. Radiology 270, 275–281, DOI: 10.1148/radiol.13130652 (2014).
  • [4] Oikonomou, A. et al. Radiomics analysis at PET/CT contributes to prognosis of recurrence and survival in lung cancer treated with stereotactic body radiotherapy. Scientific reports 8, 4003, DOI: 10.1038/s41598-018-22357-y (2018).
  • [5] de Jong, E. E. et al. Applicability of a prognostic CT-based radiomic signature model trained on stage I-III non-small cell lung cancer in stage IV non-small cell lung cancer. Lung Cancer 124, 6–11, DOI: 10.1016/j.lungcan.2018.07.023 (2018).
  • [6] Berghmans, T. et al. Primary tumor standardized uptake value (SUVmax) measured on fluorodeoxyglucose positron emission tomography (FDG-PET) is of prognostic value for survival in non-small cell lung cancer (NSCLC): A systematic review and meta-analysis (MA) by the european lung cancer working party for the IASLC lung cancer staging project. Journal of Thoracic Oncology 3, 6 – 12, DOI: https://doi.org/10.1097/JTO.0b013e31815e6d6b (2008).
  • [7] Paesmans, M. et al. Primary tumor standardized uptake value measured on fluorodeoxyglucose positron emission tomography is of prognostic value for survival in non-small cell lung cancer: Update of a systematic review and meta-analysis by the European Lung Cancer Working Party for the International Association for the Study of Lung Cancer Staging Project. Journal of Thoracic Oncology 5, 612 – 619, DOI: https://doi.org/10.1097/JTO.0b013e3181d0a4f5 (2010).
  • [8] Bollineni, V. R., Widder, J., Pruim, J., Langendijk, J. A. & Wiegman, E. M. Residual 18F-FDG-PET uptake 12 weeks after stereotactic ablative radiotherapy for stage I non-small-cell lung cancer predicts local control. International Journal of Radiation Oncology*Biology*Physics 83, e551 – e555, DOI: https://doi.org/10.1016/j.ijrobp.2012.01.012 (2012).
  • [9] Larson, S. M. et al. Tumor treatment response based on visual and quantitative changes in global tumor glycolysis using PET-FDG imaging: The visual response score and the change in total lesion glycolysis. Clinical Positron Imaging 2, 159 – 171, DOI: https://doi.org/10.1016/S1095-0397(99)00016-3 (1999).
  • [10] Liao, S. et al. Prognostic value of metabolic tumor burden on 18F-FDG PET in nonsurgical patients with non-small cell lung cancer. European Journal of Nuclear Medicine and Molecular Imaging 39, 27–38, DOI: 10.1007/s00259-011-1934-6 (2012).
  • [11] Chen, H. H., Chiu, N.-T., Su, W.-C., Guo, H.-R. & Lee, B.-F. Prognostic value of whole-body total lesion glycolysis at pretreatment FDG PET/CT in non-small cell lung cancer. Radiology 264, 559–566 (2012).
  • [12] Zaizen, Y. et al. Prognostic significance of total lesion glycolysis in patients with advanced non-small cell lung cancer receiving chemotherapy. European Journal of Radiology 81, 4179 – 4184, DOI: https://doi.org/10.1016/j.ejrad.2012.07.009 (2012). Imaging in Acute Chest Pain.
  • [13] Mehta, G., Chander, A., Huang, C., Kelly, M. & Fielding, P. Feasibility study of FDG PET/CT-derived primary tumour glycolysis as a prognostic indicator of survival in patients with non-small-cell lung cancer. Clinical Radiology 69, 268 – 274, DOI: https://doi.org/10.1016/j.crad.2013.10.010 (2014).
  • [14] Burdick, M. J. et al. Maximum standardized uptake value from staging FDG-PET/CT does not predict treatment outcome for early-stage non–small-cell lung cancer treated with stereotactic body radiotherapy. International Journal of Radiation Oncology*Biology*Physics 78, 1033 – 1039, DOI: https://doi.org/10.1016/j.ijrobp.2009.09.081 (2010).
  • [15] Agarwal, M., Brahmanday, G., Bajaj, S. K., Ravikrishnan, K. P. & Wong, C.-Y. O. Revisiting the prognostic value of preoperative 18F-fluoro-2-deoxyglucose (18F-FDG) positron emission tomography (PET) in early-stage (I & II) non-small cell lung cancers (NSCLC). European Journal of Nuclear Medicine and Molecular Imaging 37, 691–698, DOI: 10.1007/s00259-009-1291-x (2010).
  • [16] Paul, R. et al. Deep feature transfer learning in combination with traditional features predicts survival among patients with lung adenocarcinoma. Tomography 2, 388–395, DOI: 10.18383/j.tom.2016.00211 (2016).
  • [17] Lao, J. et al. A deep learning-based radiomics model for prediction of survival in glioblastoma multiforme. Scientific Reports 7, Article No. 10353, DOI: 10.1038/s41598-017-10649-8 (2017).
  • [18] Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 234–241 (Springer, 2015).
  • [19] Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2018. CA: A Cancer Journal for Clinicians 68, 7–30, DOI: 10.3322/caac.21442 (2018). https://onlinelibrary.wiley.com/doi/pdf/10.3322/caac.21442.
  • [20] Woodard, G. A., Jones, K. D. & Jablons, D. M. Lung Cancer Staging and Prognosis, 47–75 (Springer International Publishing, Cham, 2016).
  • [21] Chicklore, S. et al. Quantifying tumour heterogeneity in 18F-FDG PET/CT imaging by texture analysis. European Journal of Nuclear Medicine and Molecular Imaging 40, 133–140, DOI: 10.1007/s00259-012-2247-0 (2013).
  • [22] Lee, G. et al. Radiomics and its emerging role in lung cancer research, imaging biomarkers and clinical management: State of the art. European Journal of Radiology 86, 297 – 307, DOI: https://doi.org/10.1016/j.ejrad.2016.09.005 (2017).
  • [23] Carvalho, S. et al. 18F-fluorodeoxyglucose positron-emission tomography (FDG-PET)-radiomics of metastatic lymph nodes and primary tumor in non-small cell lung cancer (NSCLC) – a prospective externally validated study. PLOS ONE 13, 1–16, DOI: 10.1371/journal.pone.0192859 (2018).
  • [24] Fried, D. V. et al. Stage III non–small cell lung cancer: Prognostic value of FDG PET quantitative imaging features combined with clinical prognostic factors. Radiology 278, 214–222, DOI: 10.1148/radiol.2015142920 (2016). PMID: 26176655, https://doi.org/10.1148/radiol.2015142920.
  • [25] Zhang, Y., Oikonomou, A., Wong, A., Haider, M. A. & Khalvati, F. Radiomics-based prognosis analysis for non-small cell lung cancer. Scientific Reports 7, Article number: 46349 (2017).
  • [26] Diamant, A., Avishek Chatterjee, M. V., Shenouda, G. & Seuntjens, J. Deep learning in head & neck cancer outcome prediction. Scientific Reports 9, Article No: 2764, DOI: 10.1038/s41598-019-39206-1 (2019).
  • [27] Wu, X., Zhong, Z., Buatti, J. & Bai, J. Multi-scale segmentation using deep graph cuts: Robust lung tumor delineation in MVCBCT. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 514–518 (IEEE, 2018).
  • [28] Zhong, Z. et al. Simultaneous cosegmentation of tumors in PET‐CT images using deep fully convolutional networks. Medical physics 46, 619–633, DOI: 10.1002/mp.13331 (2019).
  • [29] Yosinski, J., Clune, J., Fuchs, T. & Lipson, H. Understanding neural networks through deep visualization. In ICML Workshop on Deep Learning (2015).
  • [30] Selvaraju, R. R. et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In

    2017 IEEE International Conference on Computer Vision (ICCV)

    , 618–626, DOI: 10.1109/ICCV.2017.74 (2017).