An AI-Augmented Lesion Detection Framework For Liver Metastases With Model Interpretability

07/17/2019 ∙ by Xin J. Hunt, et al. ∙ SAS Amsterdam UMC 0

Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths worldwide. Most CRC deaths are the result of progression of metastases. The assessment of metastases is done using the RECIST criterion, which is time consuming and subjective, as clinicians need to manually measure anatomical tumor sizes. AI has many successes in image object detection, but often suffers because the models used are not interpretable, leading to issues in trust and implementation in the clinical setting. We propose a framework for an AI-augmented system in which an interactive AI system assists clinicians in the metastasis assessment. We include model interpretability to give explanations of the reasoning of the underlying models.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths worldwide (Bray et al., 2018). Most cancer deaths are the result of progression of metastases. Approximately 50% of CRC patients will develop metastases to the liver (CRLM) (Abdalla et al., 2006; Donadon et al., 2007). Patients with liver-only colorectal metastases can be treated with curative intent. Complete surgical resection of CRLM is considered the only method with a chance to cure these patients (Norén et al., 2016; de Ridder et al., 2016; Angelsen et al., 2017). Only 20% of the patients with CRLM present with resectable CRLM (Wicherts et al., 2007; d Eynde and Hendlisz, 2009; Poston et al., 2008; Adam et al., 2004). Initially-unresectable liver metastases can become resectable after downsizing the lesions via systemic therapy. However, there is no consensus regarding the optimal systemic therapy regime. The effect of systemic treatment varies between patients, some have total response and others show progression of disease (Adam et al., 2004; on Gastrointestinal Tumors, 2004). Moreover, systemic therapy has a lot of side effects due to its cytotoxicity (Meyerhardt and Mayer, 2005).

In clinical oncology, the selection and monitoring of treatment is crucial for effective cancer treatment and for the evaluation of new drug therapies. Accordingly, assessment of patient response to treatment is a crucial feature in the clinical evaluation of systemic therapy. The widely accepted and applied criterion for such assessment is the Response Evaluation Criteria In Solid Tumors (RECIST), which aims to measure the objective change of anatomical tumor size. The RECIST assessment is performed by measuring changes in one-dimensional (1-D) diameter in two target lesions before and after therapy (Eisenhauer et al., 2009). Though RECIST is a clinical standard worldwide, it is highly limited. Currently, it is not possible to predict clinical outcome based on tumor response assessment (RECIST) and patient characteristics in individual patients. A meta-analysis revealed that inter-observer variability in RECIST measurement may exceed the 20% cut-off for progression, resulting in potential misclassification of diagnosis (stable disease or progression) (Yoon et al., 2016). A further problem of RECIST is the subjectivity and variability in selecting target lesions.

1.1. Assessment Automation

One of the goals of this project is to more accurately and efficiently assess tumor response. Automated medical image processing allows more objective analysis of clinically relevant imaging features. Machine learning methods like object detection models trained on clinical data can be used to detect tumor lesions and automate systemic therapy response assessment. However, fully automated systems directly based on object detection models are not ideal under current circumstances, for reasons including

  • [leftmargin=*]

  • Accuracy concerns: Modern general purpose object detection methods can only achieve 30% to roughly 75% mean average precision (mAP), depending on the dataset they are tested on(Huang et al., 2017). Specialized models trained specifically for lesion detection may achieve better accuracy, but will still make mistakes. Incorporating professional knowledge and feedback from clinicians can significantly improve detection accuracy.

  • Lack of interpretability: Most of the high-performance object detection models are based on deep learning, which are often considered black-box models. However, it is important to communicate the reason behind decisions to the physician prescribing treatment plans and to the patient. A system without the ability to explain its decisions is not desirable in the clinical setting.

Thus, in this paper we propose an interactive system, in which we use machine learning object detection models along with model interpretability and natural language generation to augment the efficacy and accuracy of clinicians in lesion detection and assessment. Model interpretability with natural language generation acts to provide trust and understanding of the underlying machine learning object detection models.

2. Related Work

2.1. Clinical studies

Artificial intelligence is quickly moving forward in many fields, including in medicine. Deep learning techniques have delivered impressive results in image recognition. Radiology and pathology are medical specialties that create and evaluate lots of medical images. Multiple studies in these specialties have used artificial intelligence to mimic or augment human capabilities.

Masood et al. (Masood et al., 2018) propose a computer-assisted decision support system with the potential to help radiologists in improving detection and diagnosis decision for pulmonary cancer stage classification. Wang et al. (Wang et al., 2016) show that using a deep learning model for automated detection of metastatic breast cancer in lymph node biopsies reduces the error rate of pathologists from over three percent to less than one percent. Hamm et al. (Hamm et al., 2019)

use a convolutional neural network for fast liver tumor diagnosis on multi-phasic MRI images.

While these initial results are positive, substantial translation or implementation of these technologies into clinical use has not yet transpired (He et al., 2019). Beyond building AI algorithms, applying them in daily clinical practice is complex (Dreyer and Geis, 2017). Key challenges for the implementation include data sharing, patient safety, data standardization, integration into complex clinical workflows, compliance with regulation, and transparency (He et al., 2019). Transparency of complex AI algorithms is of great importance for clinicians. If a medical doctor cannot understand the outcome of an algorithm, then the doctor will be unable to explain the outcome to a patient. Technologies that help explain complex AI algorithms have an important role in acceptance of AI by the medical community.

2.2. Model interpretability

In our proposed framework there is a need for an applicable model interpretability method. The majority of methods for model interpretability come in three forms: inherently-interpretable models (Angelino et al., 2017; Lou et al., 2013; Ustun and Rudin, 2016), methods for interpreting existing models (Montavon et al., 2018; Samek et al., 2017)), and post-hoc investigations (Lundberg and Lee, 2017; Ribeiro et al., 2016, 2018; Strumbelj and Kononenko, 2010). In the current framework, we use a segmentation convolutional neural network model for lesion detection in CT images for the CRLM patients. To give explanations of this model, we use Shapley values, a post-hoc, model-agnostic interpretability method. The model-agnostic nature of the Shapley values gives us the flexibility to substitute better performing models in future research.

The Shapley Values were originally introduced in game theory as a way to determine the individual contributions of players in a collaborative game. In model interpretability the Shapley values are used to measure the individual contributions of the input variable values of a single observation to a model’s prediction. Recent work in this area includes

(Lundberg and Lee, 2017; Strumbelj and Kononenko, 2010)

. Image models use as inputs a series of pixel values. Assigning Shapley values to these pixels creates a gradient over an image indicating regions of an image that lead to or detract from lesion detection probabilities. This gradient is very easy to understand, which makes Shapley values a natural method for the clinician-collaboration framework.

3. Proposed Framework

In this section we propose a system with clinician interaction for high accuracy lesion detection and measurement. The proposed system is summarized in Figure 1.

The images from a CT scan are sent to an automated report generation system. The report generation system’s purpose is to create an interactive and coherent report highlighting possible lesions. This report will allow a clinician to quickly confirm or overwrite the detections made by the automated system. The report generation system consists of three modules: a lesion detection module, an interpretability module, and a natural language generation module. The lesion detection module processes the scan images and labels all potential lesions. Note that we can use any high-accuracy object detection model as the detection module. The detected lesions are sent to the interpretability module to generate visual explanations that highlight important decision areas, which can help the clinician make better informed decisions. The natural language generation module then collects information from previous modules and generates a report for the clinician.

Once a report is generated, a clinician then reviews the report, confirming or rejecting each individual detection. The clinician can also add new lesion detections missed by the automated system. During the review process, the clinician can interact with the interpretability module, review explanations for detected lesions, and request explanations for new areas as indicated by the clinician. Once all detections are confirmed or rejected, the scan image is sent to the automated measuring system, where information like lesion count, location, and diameter is recorded.

Figure 1. Flow chart of the proposed system

4. HyperSHAP (Hyperparameterized Shapley value estimation)

The interpretability module in our proposed framework uses Shapley values. We propose a novel and deterministic approximation to the Shapley values for efficient computation.

The calculation of the Shapley values for the variable of an instance of interest, which we call a query , given a predictive model is

where the expectation is computed over all observations in a data set with total observations. The function and the relationship between and is

represents a subset of variables used in the model training. The summation in the Shapley value computation is over all , that is to say, all subsets of variables that are not the variable of interest.

It is important to note that the Shapley value computation requires all subsets of variables, of which there are

. This computation complexity makes direct computation infeasible. Instead, we rely on a deterministic approximation that uses only allowed subsets of variables as determined by a hyperparameter

, leading to the name HyperSHAP. We use a deterministic approximation to ensure stability in explanations, while still achieving high accuracy of the Shapley values.

Algorithm 2 describes the full HyperSHAP computation for a value of , using Algorithm 1, which computes the expectations.

1:input: training data , query data , model function , selection matrix

be an all-zero vector

3:for  do
4:     initialize:
5:     Let be the row of
6:     for  do
7:         Compute using the row of
9:     end for
10:     Compute , where
12:end for
Algorithm 1 Shapley Expected Values
1:input: training data , query data , model function , approximation depth
3:for  do
4:     Let be the selection matrix whose rows form the set
6:end for
7:Use Algorithm 1 to compute with
8:Let be the number of rows in
9:for  do
10:     , the column of
11:     , the row sum of excluding the column
13:     for  do
15:   where
16:     end for
18:end for
Algorithm 2 HyperSHAP

5. Preliminary Results

5.1. Clinical Data

The first phase of this project aims to improve the response assessment to systemic therapy of CRLM patients by applying advanced analytics to medical imaging and clinical data. All patient data used were collected as part of the multicenter randomized clinical trial CAIRO5 (Huiskens et al., 2015). This ongoing study aims to downsize tumor burden in the liver and make local treatment with curative intent feasible for initially unresectable colorectal liver metastases, in order to improve (disease-free) survival. The data consisted of diagnostic imaging (CT images) before and after systemic therapy, evaluated by a nationwide expert-panel. The CT-images from 52 patients were used for segmentation of the liver and liver metastases by an expert radiologist. The expert segmentations were performed semi-automatically using the Philips® IntelliSpace Portal software. A total of 1380 liver metastases were segmented, resulting in the 3-D organ contours of the liver and all metastases. From each tumor, each three-dimensional pixel (voxel) is available for analytics.

5.2. Model Training and Interpretation

We built a deep-learning image segmentation model targeting labeled lesion regions on the CT images. For each new test CT image, we use the deep-learning model to predict lesion regions, and then calculate the Shapley values for the deep-learning model on the detected lesion regions in the image. Positive Shapley values indicate pixels that contribute positively towards the predicted probability of a lesion, while negatively Shapley values indicate pixels that contribute negatively towards the model’s predicted probability of a lesion. By viewing the area that contributes towards the predicted probability of a lesion, a clinician can see what area of the CT image the deep-learning model thinks is indicative of a lesion in the liver.

In Figure 2 we show a sample report generated using the proposed framework.

Figure 2. A sample report. The clinician can confirm or reject each detected lesion by clicking the green or red button next to the bounding box. Clicking the orange button (labeled with letter “i”) will show explanations for the corresponding image patch for review. The clinician can also label new areas as lesions and request explanations.

6. Conclusion

Current lesion detection and measurement systems are not clinician-efficient, taking large amounts of clinicians’ time. These systems also suffer from significant inter-clinician variability. The proposed system optimizes the use of clinicians’ time by quickly identifying potential lesions and providing interpretation as to why the model provided such a prediction. By automating a portion of the lesion detection task, our framework can reduce inter-clinician variability.

Our initial experiments have shown promise both in the ability to detect lesions and the ability to explain the predictions of a lesion detection model. Future work includes improving the detection model, improving the interaction system with clinicians, and validating our AI-augmented framework through clinical trials.


  • (1)
  • Abdalla et al. (2006) Eddie K Abdalla, Rene Adam, Anton J Bilchik, Daniel Jaeck, Jean-Nicolas Vauthey, and David Mahvi. 2006. Improving resectability of hepatic colorectal metastases: expert consensus statement. Annals of surgical oncology 13, 10 (2006), 1271–1280.
  • Adam et al. (2004) René Adam, Valérie Delvart, Gérard Pascal, Adrian Valeanu, Denis Castaing, Daniel Azoulay, Sylvie Giacchetti, Bernard Paule, Francis Kunstlinger, Odile Ghémard, et al. 2004. Rescue surgery for unresectable colorectal liver metastases downstaged by chemotherapy: a model to predict long-term survival. Annals of surgery 240, 4 (2004), 644.
  • Angelino et al. (2017) Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. 2017. Certifiably optimal rule lists for categorical data. In Proceedings of the 23rd ACM SIGKDD Conference of Knowledge, Discovery, and Data Mining (KDD).
  • Angelsen et al. (2017) J-H Angelsen, A Horn, H Sorbye, GE Eide, IM Løes, and A Viste. 2017. Population-based study on resection rates and survival in patients with colorectal liver metastasis in Norway. British Journal of Surgery 104, 5 (2017), 580–589.
  • Bray et al. (2018) Freddie Bray, Jacques Ferlay, Isabelle Soerjomataram, Rebecca L Siegel, Lindsey A Torre, and Ahmedin Jemal. 2018.

    Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.

    CA: a cancer journal for clinicians 68, 6 (2018), 394–424.
  • d Eynde and Hendlisz (2009) MV d Eynde and Alain Hendlisz. 2009. Treatment of colorectal liver metastases: a review. Reviews on recent clinical trials 4, 1 (2009), 56–62.
  • de Ridder et al. (2016) Jannemarie AM de Ridder, Eric P van der Stok, Leonie J Mekenkamp, Bastiaan Wiering, Miriam Koopman, Cornelis JA Punt, Cornelis Verhoef, and H Johannes. 2016. Management of liver metastases in colorectal cancer patients: a retrospective case-control study of systemic therapy versus liver resection. European Journal of Cancer 59 (2016), 13–21.
  • Donadon et al. (2007) Matteo Donadon, Dario Ribero, Gareth Morris-Stiff, Eddie K Abdalla, and Jean-Nicolas Vauthey. 2007. New paradigm in the management of liver-only metastases from colorectal cancer. Gastrointestinal cancer research: GCR 1, 1 (2007), 20.
  • Dreyer and Geis (2017) Keith J Dreyer and J Raymond Geis. 2017. When machines think: radiology’s next frontier. Radiology 285, 3 (2017), 713–718.
  • Eisenhauer et al. (2009) Elizabeth A Eisenhauer, Patrick Therasse, Jan Bogaerts, Lawrence H Schwartz, D Sargent, Robert Ford, Janet Dancey, S Arbuck, Steve Gwyther, Margaret Mooney, et al. 2009. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). European journal of cancer 45, 2 (2009), 228–247.
  • Hamm et al. (2019) Charlie A Hamm, Clinton J Wang, Lynn J Savic, Marc Ferrante, Isabel Schobert, Todd Schlachter, MingDe Lin, James S Duncan, Jeffrey C Weinreb, Julius Chapiro, et al. 2019.

    Deep learning for liver tumor diagnosis part I: development of a convolutional neural network classifier for multi-phasic MRI.

    European radiology (2019), 1–10.
  • He et al. (2019) Jianxing He, Sally L Baxter, Jie Xu, Jiming Xu, Xingtao Zhou, and Kang Zhang. 2019. The practical implementation of artificial intelligence technologies in medicine. Nature medicine 25, 1 (2019), 30.
  • Huang et al. (2017) Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. 2017. Speed/accuracy trade-offs for modern convolutional object detectors. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 7310–7311.
  • Huiskens et al. (2015) Joost Huiskens, Thomas M van Gulik, Krijn P van Lienden, Marc RW Engelbrecht, Gerrit A Meijer, Nicole CT van Grieken, Jonne Schriek, Astrid Keijser, Linda Mol, I Quintus Molenaar, et al. 2015. Treatment strategies in colorectal cancer patients with initially unresectable liver-only metastases, a study protocol of the randomised phase 3 CAIRO5 study of the Dutch Colorectal Cancer Group (DCCG). BMC cancer 15, 1 (2015), 365.
  • Lou et al. (2013) Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 623–631.
  • Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. 4765–4774.
  • Masood et al. (2018) Anum Masood, Bin Sheng, Ping Li, Xuhong Hou, Xiaoer Wei, Jing Qin, and Dagan Feng. 2018. Computer-assisted decision support system in pulmonary cancer detection and stage classification on CT images. Journal of biomedical informatics 79 (2018), 117–128.
  • Meyerhardt and Mayer (2005) Jeffrey A Meyerhardt and Robert J Mayer. 2005. Systemic therapy for colorectal cancer. New England Journal of Medicine 352, 5 (2005), 476–487.
  • Montavon et al. (2018) Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2018. Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73 (2018), 1–15.
  • Norén et al. (2016) Agneta Norén, HG Eriksson, and LI Olsson. 2016. Selection for surgery and survival of synchronous colorectal liver metastases; a nationwide study. European journal of cancer 53 (2016), 105–114.
  • on Gastrointestinal Tumors (2004) National Working Group on Gastrointestinal Tumors. 2004. Colorectal carcinoma national guideline 2014.
  • Poston et al. (2008) Graeme J Poston, Joan Figueras, Felice Giuliante, Gennaro Nuzzo, Alberto F Sobrero, Jean-Francois Gigot, Bernard Nordlinger, Rene Adam, Thomas Gruenberger, Michael A Choti, et al. 2008. Urgent need for a new staging system in advanced colorectal cancer. Journal of Clinical Oncology 26, 29 (2008), 4828–4833.
  • Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 1135–1144.
  • Ribeiro et al. (2018) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Artificial Intelligence.
  • Samek et al. (2017) Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. 2017. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296 (2017).
  • Strumbelj and Kononenko (2010) Erik Strumbelj and Igor Kononenko. 2010. An Efficient Explanation of Individual Classifications Using Game Theory. J. Mach. Learn. Res. 11 (March 2010), 1–18.
  • Ustun and Rudin (2016) Berk Ustun and Cynthia Rudin. 2016. Learning Optimized Risk Scores on Large-Scale Datasets. arXiv preprint arXiv:1610.00168 (2016).
  • Wang et al. (2016) Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, and Andrew H Beck. 2016. Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718 (2016).
  • Wicherts et al. (2007) Dennis A Wicherts, Robbert J de Haas, and René Adam. 2007. Bringing unresectable liver disease to resection with curative intent. European Journal of Surgical Oncology (EJSO) 33 (2007), S42–S51.
  • Yoon et al. (2016) Soon Ho Yoon, Kyung Won Kim, Jin Mo Goo, Dong-Wan Kim, and Seokyung Hahn. 2016. Observer variability in RECIST-based tumour burden measurements: a meta-analysis. European journal of cancer 53 (2016), 5–15.