Lessons Learned from Designing an AI-Enabled Diagnosis Tool for Pathologists

06/23/2020 ∙ by Hongyan Gu, et al. ∙ 0

Despite the promises of data-driven artificial intelligence (AI), little is known about how we can bridge the gulf between traditional physician-driven diagnosis and a plausible future of medicine automated by AI. Specifically, how can we involve AI usefully in physicians' diagnosis workflow given that most AI is still nascent and error-prone (e.g., in digital pathology)? To explore this question, we first propose a series of collaborative techniques to engage human pathologists with AI given AI's capabilities and limitations, based on which we prototype Impetus - a tool where an AI takes various degrees of initiatives to provide various forms of assistance to a pathologist in detecting tumors from histological slides. Finally, we summarize observations and lessons learned from a study with eight pathologists and discuss recommendations for future work on human-centered medical AI systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 9

page 12

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The recent development of data-driven artificial intelligence (AI) is rejuvenating the use of AI in medicine that originally started over half a century ago. Enabled by data-driven statistical models, AI can already read X-Ray images (Irvin et al., 2019b; Cao et al., 2019) and analyze histological slides (Nalisnik et al., 2017; Campanella et al., 2019) with a performance on par with human experts.

Despite their promises of automating diagnosis, existing medical AI models tend to be ‘imperfect’ (Cai et al., 2019)

—there remain inherent limitations in the models’ performance and generalizability. For example, in digital pathology, scanned tissue slides are processed by AI to detect tumor cells. The problem is that such histological data (ovarian carcinoma) tends to have a high between-patient variance

(Köbel et al., 2010); thus, a pre-trained model often struggles to generalize when deployed to a new set of patients. At present, it remains underexplored how to integrate such ‘imperfect’ AI usefully into physicians’ existing workflow.

Researchers have long realized the limitation of using AI as a ‘Greek Oracle’. Miller and Masarie pointed out that a “mixed initiative system” is mandatory whereby “physician-user and the consultant program should interact symbiotically” (Miller and Masarie, 1990). Some research focused on mimicking how doctors think (Shortliffe, 1993; Drew et al., 2013), such as using an attention-guided approach to extract local regions of interest on a thoracic X-ray image (Guan et al., 2018); others developed explainable models (Caruana et al., 2015; Zhang et al., 2017) or system designs (Xie et al., 2020; Wang et al., 2019)

that promote a doctor’s awareness of AI’s diagnosis process. Cai developed a content-based image retrieval (CBIR) tool that allows a pathologist to search for similar cases by region, example or concept

(Cai et al., 2019). Yang conducted field work to identify when and how AI can fit in the decision-making process of vascular assist device transplant (Yang et al., 2016, 2019). Despite such a growing body of work on mental models, explainability, CBIR tools and field study, little is yet able to answer the following question for medical AI: when AI is still nascent and error-prone, how can physicians still make use of such ‘imperfect’ AI in their existing workflow of diagnosis?

To ground the exploration of this question, we focus on medical imaging—the primary data sources in medicine (Jiang et al., 2017). Amongst various medical imaging techniques, histological data in digital pathology (Holzinger et al., 2017), in particular, presents some of the most difficult challenges for achieving AI-automated diagnosis, thus serving as an ideal arena to explore the interactional relationship between physicians and AI.

Focusing on digital pathology as a point of departure, we propose a series of physician-AI collaboration techniques, based on which we prototype Impetus—a tool where an AI aids a pathologist in histological slide tumor detection using multiple degrees of initiatives. Trained on a limited-sized dataset, our AI model cannot fully automate the examination process; instead, Impetus harnesses AI to guide pathologists’ attention to regions of major outliers, thus helping them prioritize the manual examination process; use agile labeling to train and adapt itself on-the-fly by learning from pathologists; and take initiatives appropriately for the level of performance confidence, from automation, to pre-filling diagnosis, and to defaulting back to manual examination. We used the Impetus prototype as a medium to engage pathologists and observe how they perform diagnosis with AI involved in the process and elicit pathologists’ qualitative reactions and feedback on the aforementioned collaborative techniques. From work sessions with eight pathologists from a local medical center we summarize lessons learned as follows.

To explain AI’s guidance, suggestions and recommendations, the system should go beyond a one-size-fits-all concept and provide instance-specific details that allow a medical user to see evidence that leads to a recommendation. 1

Medical diagnosis is seldom a one-shot task, thus AI’s recommendations need to continuously direct a medical user to filter and prioritize a large task space, taking into account new information extracted from a user’s up-to-date input. 2

Medical tasks are often time-critical, thus the benefits of AI’s guidance, suggestions and recommendations need to be weighed by the amount of extra efforts incurred and the actionability of the provided information. 3

To guide the examination process with prioritization, AI should help a medical user narrow in small regions of a large task space, as well as helping them filter out information within specific regions. 4

It is possible for medical users to provide labels during their workflow with acceptable extra effort. However, the system should provide explicit feedback on how the model improves as a result, as a way to motivate and guide medical users’ future inputs. 5

Tasks treated equally by an AI might carry different weights to a medical user. Thus for medically high-staked tasks, AI should provide information to validate its confidence level. 6

Importantly, these lessons reveal what was unexpected as pathologists collaborated with AI using Impetus’ techniques, which we further discuss as design recommendations for the future development of human-centered AI for medical imaging.

1.1. Contributions

Our contributions are as follows.

  • [leftmargin=0.25in]

  • The first suite of interaction techniques in medical diagnosis that instantiate mixed-initiative principles (Horvitz, 1999) for physicians to interact with AI with adaptive degree of initiatives based on AI’s capabilities and limitations;

  • A proof-of-concept system that embodies these techniques as an integrated diagnosis tool for pathologists to detect tumors from histological slides;

  • A summary of observations and lessons learned from a study with eight pathologists that provides empirical evidence of employing mixed-initiative interaction in the medical imaging domain, thus informing future work on the design and development of human-centered AI systems.

2. Related Work

Our review of literature starts from a general body of cross-disciplinary work on human-AI interaction, gradually drill down to the (medical) imaging domain, and finally summarize current status on digital pathology, which exemplifies the gap between traditional manual diagnosis and not-yet-available AI-enabled automation.

2.1. Human-AI Interaction

Since J. C. R. Licklider’s vision of ‘man-machine symbiosis’ (Licklider, 1960), bringing human and AI to collaboratively work together has been a long-standing challenge across multiple fields.

In particular, machine learning and data science could leverage human involvement to overcome problems challenging for existing computational methods or systems. Forté enables designers to interactively express and refine sketch-based 2D design automated by topology optimization: specifically, a user can modify optimization’s result, which serves as input for the next iteration to reflect the user’s intent

(Chen et al., 2018).

Amershi propose a system that gives the user flexibility to provide better training examples in interactive concept learning (Amershi et al., 2009). The system also allows users to control the learning process and helps them decide when to stop training to avoid overfitting problems. Chau combine visualization, user interaction, and machine learning to guide users to explore correlations and to understand the structure of large-scale network data (Chau et al., 2011)

. Suh show that classifier training with mixed-initiative teaching is advantageous over both computer-initiated and human-initiated counterparts. Specifically, mixed-initiative training could significantly reduce the labeling complexity across a broad spectrum of scenarios, from perfect, helpful teachers who always provide the most helpful teaching, to naive teachers who give the completely unhelpful labels

(Suh et al., 2016). Felix propose a topic modeling system that could find unknown labels for a group of documents: by integrating human-driven label refinement and machine-driven label recommendations, the system enables analysts to explore, discover and formulate labels (Felix et al., 2018).

Research also has shown that human-AI interaction can enhance domain-specific tasks. For example, Nguyen combine human knowledge, automated information retrieval and ML, enabling a mixed-initiative approach to fact-checking (Nguyen et al., 2018).

To democratize the design of human-AI interaction, Horvitz articulated a series of principles of mixed-initiative interaction via an example of an email-to-calendar scheduling agent (Horvitz, 1999). Insights from Horvitz’s work was renewed in a recent paper by Amershi , which proposes and validates 18 guidelines for human-AI interaction, which includes, for instance, “support efficient correction”, “make clear why the system did and what it did”, “convey the consequences of user actions” (Amershi et al., 2019). Below we delve into the digital imaging process domain to review prior work where human-AI interaction can contribute.

2.2. Data-Driven Digital Image Processing

Imaging provides an abundant source of clinical data in medicine (Jiang et al., 2017). While data-driven AI has served as a powerful toolkit for processing digital images, human involvement remains an indispensable part, primarily manifested in the provision of training labels.

Ilastik (Sommer et al., 2011)

enables users to draw strokes over images for training segmentation models. The system can automatically recommend the most important features to reduce overfitting. However, the microscopic nature of such labels demands a lot of users’ effort to achieve whole-slide level performance. HistomicsML is an active learning

(Settles, 2009)

system that dynamically queries the most uncertain patches from a random forest classifier, thus allowing pathologists to refine the classification model with fewer samples iteratively. Instead of selecting the most uncertain samples, Zhu also consider the samples that would contribute most to ‘shrinking’ the hypothesis space. The paper also takes the structural hierarchy of digital histological images into account: the queried samples are spanned across different tissue partitions by the most diverse manner

(Zhu et al., 2014). As indispensable as it is, human input is often unavailable due to the accumulated effort of labeling a large amount of data.

To lower the requirement of human effort, one approach is to rely on existing whole slide-level labels, thus dispensing with the need for users to label at the pixel-level. Xu employ a multiple instance learning (MIL) approach (Xu et al., 2012)

to train a patch-level classifier based on slide-level annotations. However, it often requires a large amount of slide-level annotation for training; otherwise, there is often a performance drop compared to using strongly-supervised labels on the same set of slides. Ilse use an attention-based deep multiple instance learning method to learn a Convolutional Neural Network (CNN) with image-level annotations. The learned CNN can highlight areas that contribute the most to image-level classification results. This approach can be applied to breast or colon cancer detection, with reported performance higher than other MIL approaches without sacrificing interpretability

(Ilse et al., 2018)

. Campanell combine MIL and Recurrent Neural Network models on slide-level diagnosis with prostate cancer, basal cell carcinoma, and breast cancer metastases. The model was trained on more than 44,000 slides from more than 15,000 patients and achieved an AUC score of

0.96 in all three types of diseases.

2.3. Interactive Tools for Digital Pathology

Digital pathology, similar to biology research, often deals with high-resolution, visually challenging images. Beyond involving AI trained by domain experts, tools that allow pathologists to define, explore, and decide upon clinical or research problems are also needed.

ImageJ (Schneider et al., 2012) is one of the scientific image analysis tools that have been popularly used or extended by computational medicine research. The platform provides basic image processing tools, such as image enhancement and color processing, and allows users to perform various image operations, such as cropping, scaling, measuring, and editing. Based on ImageJ, various distributions (Schindelin et al., 2012; Rueden et al., 2017; Collins, 2007) and plugins (Jensen, 2013) have been increasingly integrated into the system, making it the most widely used software in digital pathology (Jensen, 2013).

A variant of tools are designed primarily for research for digital pathology. For example, Cellprofiller (Carpenter et al., 2006) is a cell analysis tool that assists non-computational users to quantitatively perform standard assays as well as complex morphological assays automatically without writing any code. caMicroscope (Saltz et al., 2017) enables a user to run a segmentation pipeline in a selected area and to quantitatively analyze nuclei characteristics (shape, size, intensity, texture features) in whole-slide images (WSIs). QuPath (Bankhead et al., 2017) provides extensive annotation and visualization tools for pathologists to brush over the lesion tissues rapidly. It also provides a set of ready-to-use analysis algorithms to construct customized workflow for powerful batch processing, with additional flexibility for developers to add extensions and applications. Pathology Image Informatics Platform (PIIP) (Martel et al., 2017) extends the capabilities of Sedeen viewer111https://pathcore.com/sedeen/ by adding plugins on out-of-focus detection, region of interest transformation, and immunohistochemical (IHC) slide analysis.

Recent research (Yang et al., 2016, 2019; Xie et al., 2019, 2020) suggests that, besides reasoning with medical data, the design of a diagnosis tool often needs to take into consideration physicians’ established workflow and other domain-specific behaviors. Some digital pathology tools further address the collaborative nature of performing large-scale, image-based studies. Cytomine (Marée et al., 2016), for example, provides web-based support for organizing, exploring and analyzing multi-gigapixel level image data among team members. Others focus on data curation and management, Digital Slide Archive (Gutman et al., 2017), which enables users to build their own web digital pathology archives by integrating massive image databases with clinical annotations and genomic metadata.

3. Impetus: an AI-Enabled Tool for Pathologists

Before we unfold our design process in the next section, we first introduce the background of digital pathology and the motivation to involve AI. We then walk through Impetus’s scenario to present a high-level overview of how the tool works with pathologists.

3.1. Background of Digital Pathology

Central to digital pathology are whole-slide images or WSI for short. WSIs are produced by high-speed slide scanners that digitalize the glass slide at very high resolution, resulting in gigapixel images (Litjens et al., 2018)

. Due to its large and high visual variance, a WSI cannot be directly fed into a model such as a Convolutional Neural Network (CNN) classification. A WSI is usually divided into small patches, which are then classified by a CNN model. These patch level predictions can then be assembled to create a tumor probability heatmap, from which a pathologist can derive a whole-slide level diagnosis.

In our study, we used a dataset containing H&E stained sentinel lymph node (SLN) sections of breast cancer patients (Litjens et al., 2018). The diagnosis of such specimens contains four main categories (Apple, 2016):

  • Isolated tumor cells (ITC) if the node contains single tumor cell or cell deposits that are no larger than 0.2 mm or contain fewer than 200 cells;

  • Micro if containing metastasis greater than 0.2 mm and/or more than 200 cells;

  • Macro if containing metastasis greater than 2 mm;

  • Negative if containing no tumor cells.

3.2. Promises & Challenges of AI for Digital Pathology

Digital pathology transforms traditional microscopic analysis of histological slides into high-resolution digital visualization (Holzinger et al., 2017). Digital pathology allows pathologists to investigate fine-grained pathological information, transfer previously-learned knowledge to new tasks (Holzinger et al., 2017), and, most importantly, to leverage the recent development of data-driven AI to augment their visual analytical tasks.

However, the main challenge for digital pathology is that, unlike other imaging modalities (X-Ray, CT), histological data (ovarian carcinoma) tends to have a high between-patient variance (Köbel et al., 2010); thus a pre-trained model often struggles to generalize when deployed to a new set of patients. Such an uncertainty of performance creates a barrier that prevents AI from being adopted to assist diagnosis in digital pathology.

To overcome this barrier, one solution is to improve the machine learning model by training it on a sufficiently large amount of patient data using cost-effective labeling and learning schemes (Nalisnik et al., 2017; Campanella et al., 2019). However, such a ‘big data’ approach attempts to close the gap by (marginally) improving AI’s performance, while ignoring the opportunity to engage human physicians. As a result, efforts are often bound to repeat the ‘Greek Oracle’ pitfall pointed out by Miller and Masarie almost three decades ago (Miller and Masarie, 1990). The focus of our paper is to explore and study the oft-missed opportunity of combining physicians with an ‘imperfect’ AI: rather than awaiting AI to be fully automatable one day, how can we make use of its capability with limitations today?

Below we describe a scenario walkthrough of Impetus—a tool that explores how AI—without yet the ability to diagnose fully-automated —can still empower pathologists by becoming an integral part of their workflow.

3.3. Scenario Walkthrough

The user of Impetus, a pathologist, starts the diagnosis of a patent’s case by importing multiple Whole Slide Images (WSI) of the patient into Impetus.

walkthroughwalkthrough0.4Key interactive features of Impetus: (a) as a pathologist loads a whole slide image, AI highlights areas of interest identified by outlier detection, shown as two yellow recommended boxes. (b) Agile labeling: pathologist can drag and click to provide a label that can be used to train the AI’s model. (c) Diagnosis dialogue, pre-filled with AI’s diagnosis, allows the pathologist to either confirm or disregard and proceed with manual diagnosis.

First, the pathologist’s attention is drawn to the two boxes generated by the AI, which encompass regions of patches that visually appear to be ‘outliers’ from the majority of cells (walkthrough(a)), which suggests that these patches are likely to be tumor-positive. With these automatic recommendations, Impetus alleviates the pathologist’s burden of navigating a large, high-resolution image and having to go through a large number of areas that might or might not be as tumor-characteristic as the recommended regions.

Next, the pathologist performs diagnosis by marking each recommended region as either ‘tumor’ or ‘normal’, and continues to marquee-select and label a few more regions on the WSI (walkthrough(b)). As the pathologist makes such diagnoses, their input is also collected by the back end AI and used as labels to adapt the model better to align itself with the pathologist’s domain knowledge. In contrast to conventional data labeling tasks, Impetus’ agile labeling is designed to be lightweight and able to learn from pathologists’ input of coarsely marked regions without having to trace a precise contour of a tumor region. In this way, Impetus allows pathologists to agilely train an AI model as a natural and integral part of their existing workflow without incurring extra effort.

As the pathologist diagnoses more WSIs (which also trains the AI), they notices that some new slides are already marked as ‘diagnosed’—AI takes the initiative to diagnose slides that it feels highly confident about. Thus the pathologist skips ahead to see other unlabeled slides, some of which, have pre-filled diagnosis dialogues (walkthrough(c)). In such cases, the pathologist examines the WSI to verify the AI’s hypothesis. In the rest of the WSIs, the AI almost becomes invisible (due to a lack of confidence) and the pathologist proceeds to manually finish the diagnostic tasks.

The above scenario demonstrates how an ‘imperfect’ AI can still benefit a pathologist without necessarily automating the user’s existing workflow: recommendation boxes suggestively prioritize pathologists’ manual searching process (walkthrough(a)), agile labeling adapts AI while minimizing the extra effort from the pathologists (walkthrough(b)), and as AI attempts to improve itself, it handles cases with different degrees of initiatives—from fully automation to pre-filling plausible results to remaining complete ‘invisible’—based on its confidence (walkthrough(c)).

4. Design and Implementation

Below we first describe the design process then detail the specific interaction techniques and their implementation in Impetus.

4.1. Overview of the Design Process: Empirical & Theoretical Grounding

The design of Impetus is grounded in both empirical evidence and principles drawn from literature.

On the empirical side, we co-designed Impetus with our pathologist collaborator. Specifically, we learned that one major challenge for pathologists is the ability to efficiently and effectively navigate large, high-resolution WSIs. This suggests that AI, besides making diagnosis, can usefully serve to guide pathologists to navigate complex and high-resolution image space. We detail this design in §4.2.

On the theoretical side, Impetus goes beyond the singular objective of automation by offering a spectrum of AI-enabled assistance. As pointed out by Blois’ seminal paper, a physician’s differential diagnosis process is similar to a funnel, starting with a broad exploration of plausible conditions and gradually rule out less likely possibilities as more evidence (test results) is gathered until finally a single most probable conclusion can be drawn. According to Blois, AI has been canonically developed to optimize Point B where a computer program can deterministically confirm whether a patient has a certain disease given all the evidence. As Blois foresaw, a recent development of AI starts to exhibit capabilities towards Point A, Stanford’s CheXpert produces likelihoods of 10+ thoracic diseases based on a chest X-ray image (Irvin et al., 2019a). Similarly, Impetus also aims at “reaching Point A” by enabling pathologists’ initial exploration with recommended regions.

funnelfunnel0.5A physician’s differential diagnosis process is similar to a funnel, starting with a broad exploration of plausible conditions and gradually rule out less likely possibilities as more evidence (test results) is gathered until finally a single most probable conclusion can be drawn. Impetus supports exploration near Point A by enabling pathologists’ initial exploration with recommended regions. Image credit: Blois (Blois, 1980).

Overall, Impetus provides the first suite of interaction techniques in the medical imaging domain that instantiate mixed-initiative principles (Horvitz, 1999) for physicians to interact with AI with adaptive degree of initiatives based on AI’s capabilities and limitations. Specifically, we focus on the following principles in (Horvitz, 1999):

  • [leftmargin=0.25in]

  • Scoping precision of service to match uncertainty. We first design a rule-based algorithm to identify three levels of uncertainty in AI’s performance given a WSI (detailed in §4.4), based on which we then design the corresponding AI-initiated action appropriate for each level of uncertainty (Table 1).

  • Providing mechanisms for efficient agent-user collaboration to refine results. For each AI-initiated action, we also design mechanisms to introduce physician-initiated actions aimed at confirming, refining or even overriding AI’s results (Table 1). Further, we extend this principle by leveraging physician-initiated input for ‘machine teaching’ (Simard et al., 2017), an agile labeling technique to dynamically adapt an AI by retraining it with examples of how a physician interpret a patient’s histological data (detailed in §4.3).

4.2. AI Guiding Pathologists’ Attention to Regions of Major Outliers

In our communication with our pathologist collaborator, we learned that one major limitation of pathologists is the ability to efficiently and effectively navigate a large, high-resolution WSI. To address this limitation, we design AI to guide pathologists’ attention to regions of major outliers that appear visually different from the rest of the WSI and are more likely to be tumors. Such guidance is manifested in two user interface elements:

Attention map visualizes each patch’s degree of outlying overlaid on the current WSI (Figure 1(a)); Recommendation boxes as a more explicit means to draw pathologists’ attention to large clusters based on outlier detection results (walkthrough(a))—these boxes are always visible, whether on the original WSI, on the attention, or on the prediction map (described below).

Implementation When the system is first loaded, a model trained with PatchCamelyon dataset222https://patchcamelyon.grand-challenge.org/ extracts patch features in WSIs. In the first iteration, the system performs outlier detection based on extracted features, and the detected outliers are highlighted in the attention map. In the following iterations, the attention map is a combination of outliers (from the initial detection) and high uncertainty patches (from specific models in each iteration). In order to obtain the recommendation boxes, the system clusters WSI patches with attention value. In each iteration, the recommendation boxes are selected as the two clusters which occupies the largest areas on the WSI.

Figure 1. The two maps used by Impetus to provide guidance and communicate AI results. (a) Attention map, where outlier patches and high uncertainty patches are highlighted in red, while other patches are in blue. Yellow recommendation boxes are generated by clustering attention values. (b) Prediction map, where red shows high probability of tumor, and white shows a low probability of tumor, as predicted by the AI. The green and red boxes are areas of ”normal” and ”tumor”, as labeled by the pathologist. Recommendation boxes generated by clustering attention values are also visible on this map.

4.3. AI Using Agile Labeling to Train and Adapt Itself On-the-fly

In digital pathology, the main challenge for AI is that, unlike other imaging modalities (X-ray, CT), histological data (ovarian carcinoma) tends to have a high variance across slides of different patients (sometimes same patients as well) (Köbel et al., 2010). Thus a pre-trained model often struggles to generalize to new data. To address this limitation, we design Impetus, where pathologists use agile labeling to train AI on the fly. Agile labeling allows a pathologist to directly label on recommendation boxes (walkthrough(a)) or to draw a bounding box of tumor-negative patches (walkthrough(b)), or a box containing a mix of negative and positive cells, which serve as labels to train an existing model further to incorporate pathologists’ domain knowledge. Importantly, such labeling technique is designed to be agilely achievable without incurring significant extra effort that interrupts the main diagnosis workflow.

Implementation Agile labeling does not specifically require users to provide the exact contour of tumor tissues in a WSI, as strongly-supervised learning does. Alternatively, a user can marquee-select a positive box over an area which contains

at least one tumor patches, or a negative box on all negative regions. We implemented a weakly-supervised multiple instance learning (MIL) (Zhou, 2004) to enable a traditional random forest algorithm to learn over such agile labels. To train the model, the system first initializes a and . In the MIL setup, each is a feature set within a box and only has one box-level label . For a negative box, all the instances in the box can be included in . For positive boxes, the system uses T-SNE (Maaten and Hinton, 2008) to represent the high-dimension features with two-dimension embedding

. Then, a K-Means clustering is used to split

into two clusters: , . After clustering, the algorithm compares the two clusters with negative samples from negative box to pick the real positive cluster. After the positive cluster is recognized, all the instances in the positive cluster are included in the . Finally, a random forest classifier (MIl-RF) is trained with the obtained and , and the user can continuously provide more annotations until the trained classifier reaches a satisfactory level of performance.

4.4. AI Taking Initiatives Appropriately for the Level of Performance Confidence

Even with agile labeling, lightweight on-the-fly learning only has limited improvement compared to training extensively offline. Thus it is crucial to convey the level of AI’s performance to the pathologists. In Impetus, AI takes initiatives appropriately for its performance confidence level, as manifested in the following two user interface elements: Prediction map visualizes current AI’s results overlaying the WSI, which serves to inform both the labeling and the usage of the current AI’s model (Figure 1(b)). Initiatives based on confidence—the more uncertain the AI ‘feels’ about a WSI, the less initiative it takes, as shown in Table 1.

AI Confidence AI-Initiated Action Physician-Initiated Action
High Performing diagnosis automatically in the background; marking WSIs as diagnosed Doing nothing and accepts AI’s results; can re-open a WSI to overwrite AI’s result
Pre-filling the diagnosis box without directly labeling the WSI Performing diagnosis with help from AI predictions; confirming or correcting the pre-filled dialogue
Low Showing original WSI by default to prompt for manual diagnosis Performing diagnosis with little input from AI
Table 1. Spectrum of human and AI initiatives at different AI confidence levels.

Implementation Impetus has a rule-based confidence-level classifier to sort slides into three categories: high-confidence, mid-confidence, and low-confidence. First, predictions of all the patches in the WSI are obtained. A patch has two characteristics: is_positive and is_uncertain. A patch is positive if the MIL-RF classifier output ¿ 0.5, and is uncertain if MIL-RF classifier output . We empirically summarize the confidence-level decision rules333… which can be easily modified as a configuration file of our tool. as follows:

  • [leftmargin=0.25in]

  • If there are more than 200 positive patches AND the number of positive patches is greater than twice the number of uncertain patches, then the slide is predicted as high-confidence;

  • If there are no outlier clusters, then the slide is predicted as low-confidence;

  • If the number of uncertain patches is greater than 300, then the slide is predicted as low-confidence;

  • If the number of positive patches is greater than 200, then the slide is predicted as high-confidence;

  • For all other cases, the slide is predicted as mid-confidence.

5. Work Sessions with Pathologists

To validate our design of Impetus, we observed how pathologists used this tool to perform diagnosis on a clinical dataset (Litjens et al., 2018). Our goal is to study whether the AI in Impetus can be compatibly integrated into pathologists’ workflow and can provide added values to pathologists’ diagnosis process.

Participants We recruited eight medical professionals from the pathology department in a local medical center. The participants have experiences ranging from 1 to 43 years, including residents, fellows, and attending pathologists.

Data & apparatus We used the Camelyon 17 (Litjens et al., 2018) dataset and selected 16 WSIs444Our pilot studies indicated that 16 is the number of WSIs that would allow us to finish the session in about an hour to most effectively use the pathologists’ time. that were collected in the same medical center. Participants interacted with Impetus on a 15-inch laptop computer using a wired mouse. Impetus ran on a Microsoft Windows 10 Operating System using an Nvidia 960M GPU and 16GB RAM.

Design Our discussion with pathologists collaborators and an initial screening survey indicated that currently there was not a single commonly used tool for digital pathology. To help pathologists calibrate their experience with Impetus, we introduced another tool—ASAP555https://computationalpathologygroup.github.io/ASAP/, which represents a very basic manual tool for viewing and annotating digital pathology slides. Each pathologist interacted with both Impetus and ASAP, which were referred to as System A and System B, respectively, to avoid biasing the pathologists. The order of tools was counterbalanced across the eight pathologists. Twelve of the 16 slides were diagnosed using Impetus and the remaining four using ASAP: we chose to keep more slides for Impetus as it was the target of our study, whereas ASAP was just to calibrate pathologists’ tool experience.

Figure 2. We conducted work sessions with eight pathologists from a local medical center to observe how they used Impetus as part of their diagnosis process.

Tasks & Procedure After briefly introducing the background of computer-assisted diagnosis, we walked each pathologist through a tool and let them practice on a separate toy dataset also gathered from (Litjens et al., 2018). We then asked questions about how the participant understood different interactive components, whether the tool was easy to learn and use, and whether the tool was helpful to their diagnosis. Then, the primary task began, which was to diagnose the entire group of WSIs using the provided tool in each condition. A trial started with a participant clicking to open a WSI and finished when they selected a diagnosis and clicked the ‘Confirm’ button. After each condition, we further conducted a brief semi-structured interview for each participant to summarize their experience, feedback, and suggestions for the tool. Participants took a short break between the two conditions.

Analysis We employed an iterative open-coding method to analyze the qualitative data collected from the semi-structured interviews with pathologists. Two experimenters coded each participant’s data within one day after the study. One experimenter performed the first pass of coding and updated a shared codebook, which was then reviewed by the other experimenter to resolve disagreements. The two experimenters alternated the roles of the first coder and reviewer. After all the participants’ data were coded and consolidated, a third experimenter reviewed all the codes and transcripts and resolved disagreements through discussion with the previous two experimenters. Finally, we arrived at six high-level themes, which we summarize below as lessons learned.

6. Findings & Lessons Learned

Based on the observations and data from the work sessions with pathologists, we present our findings below, which are summarized into six lessons. To maintain consistency, we organize these lessons using the same structure as §5.

6.1. AI Guiding Pathologists’ Attention to Regions of Major Outliers

6.1.1. Recommendation boxes

(walkthrougha) were the most frequently used and discussed features during the study. We observed that in almost all the trials pathologists started by zooming into the recommendation boxes and tried to provide annotations of the outlined region. Pathologists found it helpful to have such concrete start points in their examination.

… [recommendation boxes] narrow down the area of interest … it helps7

It was less effort because I was focusing only on the attention areas and not focusing on the other areas of the node so it was different from my usual way of looking at a slide.2

However, pathologists did not always find the recommended regions matched their intuition and they could not understand why certain regions were recommended.

… [the recommendation box] seems a little bit random. It’s not necessarily areas that I would [look at] …5

The things it’s focusing on does not correlate with at least what my brain thinks I am looking for.6

A lack of transparency is not a new problem in recommender system research ((Zhang et al., 2014)). When introducing Impetus, we did explain that recommendation boxes were based on a detection of visual outliers and all pathologists acknowledged that they understood such a concept. Although such outliers were computed based on histological features (the PatchCamelyon dataset), they did not always agree with what pathologists intuited as ‘interesting’ regions worth examination. When such a mismatch occurred—an unexpected case of recommendation, pathologists could no longer reason about the recommendation boxes simply by referring to the abstract concept of ‘visual outliers’. At times, pathologists started to develop their own hypothesis of how AI was processing the WSI: … it’s interesting that it’s picking area with fat as area of interest.2

To explain AI’s guidance, suggestions and recommendations, the system should go beyond a one-size-fits-all concept and provide instance-specific details that allow a medical user to see evidence that leads to a recommendation. 1

We also found that pathologists wondered what they should do about the area outside of the recommendation boxes:

So I just look at the ones in the [recommendation] square?7

Am I supposed to assume the rest of it is normal? I don’t have to go searching for the rest of the slides for [tumor]?2

Pathologists understood the implication in the recommendation boxes, to prioritize certain regions of a WSI and to serve as a ‘shortcut’ in lieu of scanning the entire WSI. However, it was unclear what was the implication outside of the recommendation boxes. This is especially true when pathologists could not find signs of tumor in the recommended regions: the system did not continue to guide them on how to proceed with the rest of the WSI.

Medical diagnosis is seldom a one-shot task, thus AI’s recommendations need to continuously direct a medical user to filter and prioritize a large task space, taking into account new information extracted from a user’s up-to-date input. 2

6.1.2. Attention map

(Figure 1a) visualizes outliers detected by the AI—the same information based on which the recommendation boxes were drawn. It was designed to complement recommendation boxes with a backdrop of detailed guidance. We expected pathologists to use the attention map similarly as the recommendation boxes, to direct their attention to look for more outlying regions for examination. However, pathologists did not find attention map useful:

The attention map shows the same thing as the recommended box. The box is enough to direct my attention.2

I don’t really see the point of the attention map … These two maps are redundant.4

The main difference was that recommendation boxes cost less effort to process, while attention map needed to be navigated (panned and zoomed and interpreted (mentally ‘decoding’ the color scheme). Further, recommendation boxes provided actionable information (to look into this box first), while attention map is action-neutral. Given that pathologists’ overall goal is to eliminate the amount of area to study, they tended to prefer less extra effort and information with clearer actionability.

Medical tasks are often time-critical, thus the benefits of AI’s guidance, suggestions and recommendations need to be weighed by the amount of extra efforts incurred and the actionability of the provided information. 3

6.2. AI Using Agile Labeling to Train and Adapt Itself On-the-Fly

Prediction map (Figure 1b) visualizes current AI’s diagnosis of the WSI and was designed to help the pathologists assess the model’s performance and decide where they could provide more labels.

However, pathologists used prediction map differently than we expected. Pathologists would often zoom into recommendation boxes on the WSI, study the region for a few seconds, then switch to the prediction map for a few seconds, and switch back to WSI. They tended to use the prediction map as a tool to help them see if there is something ‘interesting’ in the current zoomed-in region. Sometimes pathologists used the prediction map to double check their developing diagnosis:

That was all negative, and I didn’t get a strong heatmap signal, so it was confirmatory and somewhat helpful.6

Interestingly, how the prediction map was used by pathologists seemed to complement the recommendation boxes: while recommendation boxes told pathologists which region is worth looking at (might contain tumors), prediction map confirmed pathologists’ assumption when they thought a region was of little ‘interest’ (no signs of tumor).

To guide the examination process with prioritization, AI should help a medical user narrow in small regions of a large task space, as well as helping them filter out information within regions. 4

The unexpected usage of prediction map affected agile labeling, as we discuss below.

Agile labeling (walkthroughb) allows a pathologist to directly label on a recommendation box, or to marquee-select a region to coarsely annotate as normal or tumor. In the introduction phase, all pathologists reported having no problem understanding the idea of continuously labeling WSIs to improve the AI:

This is actually adding more work for me, but I would be willing to add labels knowing I would be improving the model4

However, during the tasks, we noticed that almost all the labels were drawn only based on the recommendation boxes. Only one pathologist actively searched other regions to draw and provide more labels. It seemed that recommendation boxes served as a prompt and pathologists were unmotivated to label other regions if unprompted.

We believe one fundamental reason is a lack of feedback to inform pathologists how important their labels were to the model retraining. Without such feedback, it might have been unclear to pathologists whether they needed to provide labels at all, or how much labeling would be enough.

Do I need to add labels?6

Should I have provided more labels?5

We assume that once pathologist see how a prediction map contained inaccurate results, they would be motivated to provide more labels to improve the prediction. However, our observations show that pathologists were more likely to make a diagnosis directly by manual examination, instead of correcting AI’s predictions as we expected. Falling back to manual examination seems a more cost-effective alternative to AI automation than trying to iteratively improve the AI.

It is possible for medical users to provide labels during their workflow with acceptable extra effort. However, the system should provide explicit feedback on how the model improves as a result, as a way to motivate and guide medical users’ future inputs. 5

6.3. AI Taking Initiatives Appropriately for the Level of Performance Confidence

As shown in Table 1, AI’s level of initiatives is mediated based on its level of confidence about the model’s performance. For low-confidence cases, AI took no initiatives and all pathologists were mostly unaware of AI’s presence, when they simply focused on performing the usual manual diagnosis. For high-confidence cases, as expected, pathologists quickly confirmed AI’s proactive diagnosis of macro—the easiest type of tumor to detect by both pathologists and AI. However, when it comes to cases diagnosed as negative by the AI, pathologists tended to perform a manual diagnosis anyway:

On the ones that it said it’s confident but didn’t really tell you it’s negative, I still felt like I had to look at those to confirm. I wasn’t going to trust the system [to confirm] that it’s negative2

In pathology, in order to rule out tumors, pathologists have to thoroughly examine the entire WSI, whereas it only takes one positive case to diagnose the lymph node as positive. Thus there was a discrepancy of trust between macro vs. negative, despite that AI treats both equally as different labels of a slide image and categorizes both as high confidence.

Tasks treated equally by an AI might carry different weights to a medical user. Thus for medically high-staked tasks, AI should provide information to validate its confidence level. 6

For the mid-confidence case, AI was designed to pre-fill the diagnosis dialog (but without any confirmative action) as a way to hint its prediction without signaling any conclusive decision. This design did not seem to have noticeable effects on the pathologists, which echos Lesson #3 that information needs to present actionability in order to affect a medical user’s workflow.

7. Design Recommendation for Human-Centered Medical Imaging AI

Extending the six lessons learned, we further discuss design recommendations for future development in AI-enabled medical imaging. We describe each recommendation in the specific context of Impetus as a way to ground the readers’ understanding, although we believe these recommendations can be extended to other AI-enabled medical imaging techniques (X-ray, CT, MRI), which we leave as future work. Further, some of our recommendations suggest new technical challenges for machine learning and AI communities.

Relationship to prior work. There has been a school of prior work focusing on providing guidelines for designing general human-AI interaction. Horvitz’s principles can be thought of as a list of high-level requirements for a system to incorporate mixed-initiative interaction (Horvitz, 1999). Amershi ’s list of guidelines informs how to design an AI’s behavior throughout interaction (Amershi et al., 2019). Beyond these general guidelines, Yang et al. contribute insights on designing AI for the medical domain based on field studies. Wang et al. follow a top-down approach to develop a conceptual framework that links people’s reasoning processes to explainable AI and apply such a framework in designing a tool for phenotyping (Wang et al., 2019). In comparison, our recommendations are bottom-up, stemming from an end-to-end exercise of designing, implementing and testing an actual system with medical professionals. As shown below, all the recommendations are developed from the lessons mentioned above and can be translated to the next design iteration of Impetus, as well as extended to other medical AI systems.

To explain AI’s guidance, suggestions and recommendations, the system should go beyond a one-size-fits-all concept and provide instance-specific details that allow a medical user to see evidence that leads to a recommendation. 1

Recommendation #1: an overview + instance-based explanation of AI’s suggestions. Currently, Impetus only provides an explanation of the suggested regions at the overview level: a textual description of the outlier detection method as part of the tutorial and visualization (attention map) that shows the degree of ‘outlying’ across the WSI. As an addition, we can further incorporate instance-based explanation, with information specific to a particular patient and a particular region on the patient’s slide. The idea is to allow pathologists to question why a specific region is recommended by clicking on the corresponding part of the slide, which prompts the system to show a comparison between the recommended region and a number of samples from non-recommended parts of the slide for the physician to contrast features in these regions extracted by AI. One important consideration is that such an additional explanation should be made available on-demand rather than shown by default, which could defeat the recommendation boxes’ purpose to accelerate the pathologists’ examination process.

Medical diagnosis is seldom a one-shot task, thus AI’s recommendations need to continuously direct a medical user to filter and prioritize a large task space, taking into account new information extracted from a user’s up-to-date input. 2

Recommendation #2: make AI-generated suggestions always available (and constantly evolving) throughout the process of a (manual) examination. For example, in Impetus, a straightforward design idea is to show recommendation boxes one after another. We believe this is especially helpful when the pathologist might be drawn to a local, zoomed-in region and neglect looking at the rest of the WSI. The always available recommendation boxes can serve as global anchors that inform pathologists of what might need to be examined elsewhere beyond the current view.

Medical tasks are often time-critical, thus the benefits of AI’s guidance, suggestions and recommendations need to be weighed by the amount of extra efforts incurred and the actionability of the provided information. 3

Recommendation #3: weigh the amount of extra efforts by co-designing a system with target medical users, as different physicians have different notion of time urgency. Emergency room doctors often deal with urgent cases by making decisions in a matter of seconds, and internists often perform examinations in 15-20 minutes per patient, oncologists or implant specialists might decide on a case via multiple meetings that span days. There is a sense of timeliness in all these scenarios, but the amount of time that can be budgeted differs from case to case. To address such differences, we further recommend modeling each interactive task in a medical AI system (how long it might take for the user to perform each task) and provide a mechanism that allows physicians to ‘filter out’ interactive components that might take too much time (the attention map in Impetus). Importantly, different levels of urgency should be modifiable (perhaps as a one-time setup) by physicians in different specialties.

To guide the examination process with prioritization, AI should help a medical user narrow in small regions of a large task space, as well as helping them filter out information within specific regions. 4

Recommendation #4: use visualization to filter out information, leverage AI’s results to reduce information load for the physicians. An example would be a spotlight effect that darkens parts of a WSI where AI detects little or no tumor cells. Based on our observation that pathologists used AI’s results to confirm their examination on the original H&E WSI, such an overt visualization can help them filter out subsets of the WSI patches. Meanwhile, pathologists can also reveal a darkened region if they want to examine further AI’s findings (when they disagree with AI, believing a darkened spot has signs of tumor).

It is possible for medical users to provide labels during their workflow with acceptable extra effort. However, the system should provide explicit feedback on how the model improves as a result, as a way to motivate and guide medical users’ future inputs. 5

Recommendation #5: when adapting the model on-the-fly, show a visualization that indicates the model’s performance changes as the physician labels more data. There could be various designs of such information, from showing low-level technical details (the model’s specificity vs. sensitivity), high-level visualization (charts that plot accuracy over WSIs read) and even actionable items (‘nudging’ the user to label certain classes of data to balance the training set). There are two main factors to consider when evaluating a given design: as we observed in our study, whether the design could inform the physician of the model’s performance improvement or degradation as they label more data, which can be measured quantitatively as the amount of performance gain divided by the amount of labeling work done; as we noted in Lesson #2, whether consuming the extra information incurs too much effort and slows down the agile labeling process and whether there is actionability given the extra information about model performance changes.

Tasks treated equally by an AI might carry different weights to a medical user. Thus for medically high-staked tasks, AI should provide information to validate its confidence level. 6

Recommendation #6: provide additional justification for a negative diagnosis of a high-staked disease. For example, when Impetus concludes a case as negative, the system can still display the top five regions wherein AI finds the most likely signs of tumor (albeit below a threshold of positivity). In this way, even if the result turned out to be a false negative, the physicians would be guided to examine regions where the actual tumor cells are likely to appear. Beyond such intrinsic details, it is also possible to retrieve extrinsic information, prevalence of the disease given the patient’s population, or similar histological images for comparison. As suggested in (Xie et al., 2020), such extrinsic justification can complement the explanation of a model’s intrinsic process, thus allowing physicians to understand AI’s decision more comprehensively.

8. Discussion and Conclusion

This paper explores how AI’s capabilities with limitations can still benefit a traditional manual diagnosis process on histological data. We investigate this question through the design and study of Impetus, a tool where an AI takes multi-leveled initiatives to provide various forms of assistance to a pathologist performing tumor detection in whole-slide histological images. We conducted work sessions with eight medical professionals using Impetus and summarize our findings and lessons learned, based on which we provide design recommendations with concrete examples to inform future work on human-centered medical AI systems.

Below, we discuss several limitations encountered during the development of our system.

Detecting small lesion tissues in a WSI. We found it hard for the system to detect small lesion tissues in a WSI during the work session. However, from our interviews with medical professionals, it is more valuable to find those areas with AI, whereas large, macro tissues can be located quickly without assistance. Here we summarize the reasons why the machine learning algorithm fails to localize small lesion tissues. First, the system takes patches as input to extract the features and classifies them with trained MIL-RF. However, this approach can be problematic when detecting small-area lesion tissues, since the classification performance highly depends on color, and small metastasis do not change the color of patches significantly. Further, the machine learning model treats tissues in a WSI as separate patches without considering structural correlations among the tissues. Specifically, lymph node invasion starts from the perimeters of the node. Thus small lesion tissues are more likely to appear in those peripheral regions.

Integration into pathologists’ workflow

. Data encountered in a pathologist’s day-to-day workflow are imbalanced by nature: metastasis areas often occupy small fractions of the entire WSI. As a result, there would be more negative annotations than positive ones. Furthermore, with the coarse labeling enabled by our MIL algorithm, only a subset of the patches in a positive annotation are truly positive patches. This imbalance in training data skews the model’s predictions.

On the other hand, to use an AI system for diagnosis, the AI’s performance must be validated. However, a trained-on-the-fly AI can not be practically validated, it is not feasible to fully validate the model’s performance after every iteration of labeling and training, even though the number of new training data is small. Such a dynamic system is hard to control to maintain a certain level of performance regardless of run-time user interaction.

Combining prior knowledge. In the real diagnosis environment of a pathologist, extra patient information is necessary for diagnosing a glass slide, which often includes the patient’s medical history and type of cancer as determined from previous examinations. From our interviews with pathologists, we learn that such information is crucial for diagnosis speed and accuracy, as it informs the pathologist on what to look for and where to find them. To better match the AI with a pathologist’s mental model and provide better guidance and explanations, we should incorporate patient information into the diagnosis model. For example, the AI might use a different CNN to look for a specific type of tumor tissues given a particular cancer type.

Explicit vs. implicit feedback. So far, Impetus primarily relies on explicit feedback from physicians via agile labeling. Future work should leverage other information as implicit feedback, what kinds of WSI areas a pathologist looks at first, or spends the most time examining, where and how much they zoom in. Inferring useful information for adapting the model presents new technical challenges for the machine learning community; for HCI and CSCW researchers, the challenge is making such inference transparent by informing pathologists how AI is learning from some of their implicit behavior.

References

  • S. Amershi, J. Fogarty, A. Kapoor, and D. Tan (2009) Overview based example selection in end user interactive concept learning. In Proceedings of the 22nd annual ACM symposium on User interface software and technology, pp. 247–256. Cited by: §2.1.
  • S. Amershi, D. Weld, M. Vorvoreanu, A. Fourney, B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen, et al. (2019) Guidelines for human-ai interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 3. Cited by: §2.1, §7.
  • S. K. Apple (2016) Sentinel lymph node in breast cancer: review article from a pathologist’s point of view. Journal of pathology and translational medicine 50 (2), pp. 83. Cited by: §3.1.
  • P. Bankhead, M. B. Loughrey, J. A. Fernández, Y. Dombrowski, D. G. McArt, P. D. Dunne, S. McQuaid, R. T. Gray, L. J. Murray, H. G. Coleman, et al. (2017)

    QuPath: open source software for digital pathology image analysis

    .
    Scientific reports 7 (1), pp. 16878. Cited by: §2.3.
  • M. S. Blois (1980) Clinical judgment and computers. New England Journal of Medicine 303 (4), pp. 192–197. Note: PMID: 7383090 External Links: Document, Link, https://doi.org/10.1056/NEJM198007243030405 Cited by: §4.1.
  • C. J. Cai, E. Reif, N. Hegde, J. Hipp, B. Kim, D. Smilkov, M. Wattenberg, F. Viegas, G. S. Corrado, M. C. Stumpe, and M. Terry (2019) Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, Glasgow, Scotland Uk, pp. 1–14. External Links: ISBN 978-1-4503-5970-2, Link, Document Cited by: §1, §1.
  • G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. K. Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs (2019)

    Clinical-grade computational pathology using weakly supervised deep learning on whole slide images

    .
    Nature medicine 25 (8), pp. 1301–1309. Cited by: §1, §3.2.
  • R. Cao, A. M. Bajgiran, S. A. Mirak, S. Shakeri, X. Zhong, D. Enzmann, S. Raman, and K. Sung (2019) Joint prostate cancer detection and gleason score prediction in mp-mri via focalnet. IEEE Transactions on Medical Imaging (), pp. 1–1. External Links: Document, ISSN Cited by: §1.
  • A. E. Carpenter, T. R. Jones, M. R. Lamprecht, C. Clarke, I. H. Kang, O. Friman, D. A. Guertin, J. H. Chang, R. A. Lindquist, J. Moffat, et al. (2006) CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome biology 7 (10), pp. R100. Cited by: §2.3.
  • R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad (2015) Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, External Links: Document, ISBN 9781450336642 Cited by: §1.
  • D. H. Chau, A. Kittur, J. I. Hong, and C. Faloutsos (2011) Apolo: making sense of large network data by combining rich user interaction and machine learning. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 167–176. Cited by: §2.1.
  • X. ‘. Chen, Y. Tao, G. Wang, R. Kang, T. Grossman, S. Coros, and S. E. Hudson (2018) Forte: User-Driven Generative Design. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 496. Cited by: §2.1.
  • T. J. Collins (2007) ImageJ for microscopy. Biotechniques 43 (S1), pp. S25–S30. Cited by: §2.3.
  • T. Drew, K. Evans, M. L. -H. Võ, F. L. Jacobson, and J. M. Wolfe (2013) Informatics in Radiology: What Can You See in a Single Glance and How Might This Guide Visual Search in Medical Images?. RadioGraphics 33 (1), pp. 263–274. External Links: Document, ISSN 0271-5333, Link Cited by: §1.
  • M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: Appendix A.
  • C. Felix, A. Dasgupta, and E. Bertini (2018) The exploratory labeling assistant: mixed-initiative label curation with large document collections. In The 31st Annual ACM Symposium on User Interface Software and Technology, pp. 153–164. Cited by: §2.1.
  • Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, and Y. Yang (2018) Diagnose like a radiologist: attention guided convolutional neural network for thorax disease classification. arXiv preprint arXiv:1801.09927. Cited by: §1.
  • D. A. Gutman, M. Khalilia, S. Lee, M. Nalisnik, Z. Mullen, J. Beezley, D. R. Chittajallu, D. Manthey, and L. A. Cooper (2017) The digital slide archive: a software platform for management, integration, and analysis of histology for cancer research. Cancer research 77 (21), pp. e75–e78. Cited by: §2.3.
  • A. Holzinger, B. Malle, P. Kieseberg, P. M. Roth, H. Müller, R. Reihs, and K. Zatloukal (2017) Towards the augmented pathologist: Challenges of explainable-ai in digital pathology. arXiv preprint arXiv:1712.06657. Cited by: §1, §3.2.
  • E. Horvitz (1999) Principles of mixed-initiative user interfaces. In Conference on Human Factors in Computing Systems - Proceedings, External Links: Document, ISBN 0201485591 Cited by: 1st item, §2.1, §4.1, §7.
  • M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §2.2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: Appendix A.
  • J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019a) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 590–597. Cited by: §4.1.
  • J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, and A. Y. Ng (2019b) CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. External Links: 1901.07031, Link Cited by: §1.
  • E. C. Jensen (2013) Quantitative analysis of histological staining and fluorescence using imagej. The Anatomical Record 296 (3), pp. 378–381. Cited by: §2.3.
  • F. Jiang, Y. Jiang, H. Zhi, Y. Dong, H. Li, S. Ma, Y. Wang, Q. Dong, H. Shen, and Y. Wang (2017) Artificial intelligence in healthcare: Past, present and future. Vol. 2, BMJ Publishing Group. External Links: Document, ISSN 20598696, Link Cited by: §1, §2.2.
  • M. Köbel, S. E. Kalloger, P. M. Baker, C. A. Ewanowich, J. Arseneau, V. Zherebitskiy, S. Abdulkarim, S. Leung, M. A. Duggan, D. Fontaine, et al. (2010) Diagnosis of ovarian carcinoma cell type is highly reproducible: a transcanadian study. The American journal of surgical pathology 34 (7), pp. 984–993. Cited by: §1, §3.2, §4.3.
  • J. C. R. Licklider (1960) Man-computer symbiosis. IRE transactions on human factors in electronics (1), pp. 4–11. Cited by: §2.1.
  • G. Litjens, P. Bandi, B. Ehteshami Bejnordi, O. Geessink, M. Balkenhol, P. Bult, A. Halilovic, M. Hermsen, R. van de Loo, R. Vogels, et al. (2018) 1399 h&e-stained sentinel lymph node sections of breast cancer patients: the camelyon dataset. GigaScience 7 (6), pp. giy065. Cited by: §3.1, §3.1, §5, §5, §5.
  • F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: Appendix A.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Appendix A, §4.3.
  • R. Marée, L. Rollus, B. Stévens, R. Hoyoux, G. Louppe, R. Vandaele, J. Begon, P. Kainz, P. Geurts, and L. Wehenkel (2016) Collaborative analysis of multi-gigapixel imaging data using cytomine. Bioinformatics 32 (9), pp. 1395–1401. Cited by: §2.3.
  • A. L. Martel, D. Hosseinzadeh, C. Senaras, Y. Zhou, A. Yazdanpanah, R. Shojaii, E. S. Patterson, A. Madabhushi, and M. N. Gurcan (2017) An image analysis resource for cancer research: piip—pathology image informatics platform for visualization, analysis, and management. Cancer research 77 (21), pp. e83–e86. Cited by: §2.3.
  • R. A. Miller and F. E. Masarie (1990) The demise of the ’Greek Oracle’ model for medical diagnostic systems. External Links: ISSN 00261270 Cited by: §1, §3.2.
  • M. Nalisnik, M. Amgad, S. Lee, S. H. Halani, J. E. V. Vega, D. J. Brat, D. A. Gutman, and L. A. Cooper (2017) Interactive phenotyping of large-scale histology imaging data with histomicsml. Scientific reports 7 (1), pp. 14588. Cited by: §1, §3.2.
  • A. T. Nguyen, A. Kharosekar, S. Krishnan, S. Krishnan, E. Tate, B. C. Wallace, and M. Lease (2018) Believe it or not: designing a human-ai partnership for mixed-initiative fact-checking. In The 31st Annual ACM Symposium on User Interface Software and Technology, pp. 189–199. Cited by: §2.1.
  • C. T. Rueden, J. Schindelin, M. C. Hiner, B. E. DeZonia, A. E. Walter, E. T. Arena, and K. W. Eliceiri (2017) ImageJ2: imagej for the next generation of scientific image data. BMC bioinformatics 18 (1), pp. 529. Cited by: §2.3.
  • J. Saltz, A. Sharma, G. Iyer, E. Bremer, F. Wang, A. Jasniewski, T. DiPrima, J. S. Almeida, Y. Gao, T. Zhao, et al. (2017) A containerized software system for generation, management, and exploration of features from whole slide tissue images. Cancer research 77 (21), pp. e79–e82. Cited by: §2.3.
  • J. Schindelin, I. Arganda-Carreras, E. Frise, V. Kaynig, M. Longair, T. Pietzsch, S. Preibisch, C. Rueden, S. Saalfeld, B. Schmid, et al. (2012) Fiji: an open-source platform for biological-image analysis. Nature methods 9 (7), pp. 676. Cited by: §2.3.
  • C. A. Schneider, W. S. Rasband, and K. W. Eliceiri (2012) NIH image to imagej: 25 years of image analysis. Nature methods 9 (7), pp. 671. Cited by: §2.3.
  • B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §2.2.
  • E. H. Shortliffe (1993) The adolescence of AI in Medicine: Will the field come of age in the ’90s?. Artificial Intelligence In Medicine. External Links: Document, ISSN 09333657 Cited by: §1.
  • P. Y. Simard, S. Amershi, D. M. Chickering, A. E. Pelton, S. Ghorashi, C. Meek, G. Ramos, J. Suh, J. Verwey, M. Wang, et al. (2017) Machine teaching: a new paradigm for building machine learning systems. arXiv preprint arXiv:1707.06742. Cited by: 2nd item.
  • C. Sommer, C. Straehle, U. Koethe, and F. A. Hamprecht (2011) Ilastik: interactive learning and segmentation toolkit. In 2011 IEEE international symposium on biomedical imaging: From nano to macro, pp. 230–233. Cited by: §2.2.
  • J. Suh, X. Zhu, and S. Amershi (2016) The label complexity of mixed-initiative classifier training. In International Conference on Machine Learning, pp. 2800–2809. Cited by: §2.1.
  • D. Wang, Q. Yang, A. Abdul, and B. Y. Lim (2019) Designing Theory-Driven User-Centric Explainable AI. External Links: Document, ISBN 9781450359702, Link Cited by: §1, §7.
  • Y. Xie, M. Chen, D. Kao, G. Gao, and X. ‘. Chen (2020) CheXplain: Enabling Physicians to Explore and Understand Data-Driven, AI-Enabled Medical Imaging Analysis. In To appear at the 2020 CHI Conference on Human Factors in Computing Systems, Cited by: §1, §2.3, §7.
  • Y. Xie, X. ‘. Chen, and G. Gao (2019) Outlining the design space of explainable intelligent systems for medical diagnosis. In ACM IUI, External Links: Link Cited by: §2.3.
  • Y. Xu, J. Zhu, E. Chang, and Z. Tu (2012) Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 964–971. Cited by: §2.2.
  • Q. Yang, A. Steinfeld, and J. Zimmerman (2019) Unremarkable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes. Conference on Human Factors in Computing Systems - Proceedings. External Links: Document, 1904.09612, ISBN 9781450359702, Link Cited by: §1, §2.3.
  • Q. Yang, J. Zimmerman, A. Steinfeld, L. Carey, and J. F. Antaki (2016) Investigating the Heart Pump Implant Decision Process. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI ’16, New York, New York, USA, pp. 4477–4488. External Links: Document, ISBN 9781450333627, ISSN 1073-0516, Link Cited by: §1, §2.3.
  • Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma (2014)

    Explicit factor models for explainable recommendation based on phrase-level sentiment analysis

    .
    In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 83–92. Cited by: §6.1.1.
  • Z. Zhang, Y. Xie, F. Xing, M. McGough, and L. Yang (2017) MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network. External Links: 1707.02485, Link Cited by: §1.
  • Z. Zhou (2004) Multi-instance learning: a survey. Department of Computer Science & Technology, Nanjing University, Tech. Rep. Cited by: §4.3.
  • Y. Zhu, S. Zhang, W. Liu, and D. N. Metaxas (2014) Scalable histopathological image analysis via active learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 369–376. Cited by: §2.2.

Appendix A Implementation Details of Impetus Software

In this section, we introduce the implementation of Impetus. Overall, the software was written in Python and the detailed implementation of Impetus’ capabilities are shown as follows.

Attention map & recommendation boxes When the system is first loaded, Otsu method is first utilized to separate foreground tissues and background. Then a pre-trained InceptionResNetv2 model (Ioffe and Szegedy, 2015) with PatchCamelyon dataset666https://patchcamelyon.grand-challenge.org/ extracts patch features in WSIs. The size of each patch is , and the extracted features have a dimension of 1536. In the first iteration, the system performs isolation forest (Liu et al., 2008) outlier detection based on extracted features, and the detected outliers are highlighted in the attention map. In the following iterations, the attention map is a combination of outliers (from the initial detection) and high uncertainty patches (from specific models in each iteration). Uncertainty is calculated as . The attention map is calculated as the soft-OR of outliers and uncertainty: . In order to obtain the recommendation boxes, the system uses a DBSCAN clustering algorithm (Ester et al., 1996) on the attention map to find clusters. In each iteration, the recommendation boxes are selected as the two clusters which occupies the largest areas on the WSI.

  initialize ,
  while Performance not satisfied do
     Annotate boxes with labels
     for  do
        if  then
           
        else
           Embed to with T-SNE
           Split into , with K-Means, map the split to original data ,
           Assign each instance in with labels, and with labels
           Train a random forest classifier with and
           Predict negative box
           Adjust the labels of and
           Append positive set with positive instances
        end if
     end for
     Train a random forest classifier with and
  end while
Algorithm 1 Impetus MIL Training

Agile labeling Impetus uses an agile labelling technique that does not specifically require users to provide the exact contour of tumor tissues in a WSI, as strongly-supervised learning does. Alternatively, a user can mark a positive box over an area which contains at least one tumor patches, or a negative box on all negative regions. We implemented a weakly-supervised multiple instance learning (MIL) to enable a traditional random forest algorithm learn over such agile labels. As shown in Algorithm 1, is a feature set in within a box and only has one box-level label . For a negative box, all the instances in the box can be included in . In order to distinguish the real ‘positive’ patch in a positive box, we first use manifold learning – T-SNE (Maaten and Hinton, 2008) – to represent the 1536 dimension feature with two-dimension embedding . Here, it is assumed that the T-SNE would embed the positive box instances into one positive and one negative cluster, thus we use K-Means clustering to split into two clusters: , . By using the split, the original 1536-dimention can be partitioned into and . After clustering, the algorithm compares the two clusters with negative samples from negative box to pick the real positive cluster. To achieve this goal, it first assigns each instance in with labels, and that in with labels, then trains a random forest classifier with and . The trained classifier is used to predict negative box instances previously provided by the user. Finally, the cluster which has the opposite prediction to the negative box is the positive cluster. After the positive cluster is recognized, all the instances in the positive cluster are included in the , and the instances in the rest cluster are aborted. Finally, a random forest classifier (MIL-RF) is trained with the obtained and , and the user can iteratively provide more annotations until the trained classifier reaches a satisfactory level of performance.

  get WSI prediction and uncertainty
  if positive 200 && positive 2 uncertainty then
     Predict high-confidence
  else if Outlier Cluster == 0 then
     Predict low-confidence
  else if uncertainty 300 then
     Predict low-confidence
  else if positive 200 then
     Predict high-confidence
  else
     Predict mid-confidence
  end if
Algorithm 2 Confidence-level Classifier

Confidence calculation Impetus has a rule-based confidence-level classifier to classify slides into three categories, namely high-confidence, mid-confidence and low-confidence. As shown in Algorithm 2, the system first obtains the predictions of all the patches in the WSI. A patch has two characteristics: is_positive and is_uncertain. A patch is positive if the MIL-RF classifier output is greater than 0.5, and is uncertain if MIL-RF classifier output . We empirically summarize the confidence-level decision rules as follows:

  • if there are more than 200 positive patches AND the number of positive patches is greater than twice the number of uncertain patches, then the slide is predicted as high-confidence;

  • if there are no outlier clusters, then the slide is predicted as low-confidence;

  • if the number of uncertain patches is greater than 300, then the slide is predicted as low-confidence;

  • if the number of positive patches is greater than 200, then the slide is predicted as high-confidence;

  • for all other cases, the slide is predicted as mid-confidence.