CrossPath: Top-down, Cross Data Type, Multi-Criterion Histological Analysis by Shepherding Mixed AI Models

06/23/2020 ∙ by Hongyan Gu, et al. ∙ 0

Data-driven AI promises support for pathologists to discover sparse tumor patterns in high-resolution histological images. However, three limitations prevent AI from being adopted into clinical practice: (i) a lack of comprehensiveness where most AI algorithms only rely on single criteria/examination; (ii) a lack of explainability where AI models work as 'black-boxes' with little transparency; (iii) a lack of integrability where it is unclear how AI can become part of pathologists' existing workflow. To address these limitations, we propose CrossPath: a brain tumor grading tool that supports top-down, cross data type, multi-criterion histological analysis, where pathologists can shepherd mixed AI models. CrossPath first uses AI to discover multiple histological criteria with H and E and Ki-67 slides based on WHO guidelines. Second, CrossPath demonstrates AI findings with multi-level explainable supportive evidence. Finally, CrossPath provides a top-down shepherding workflow to help pathologists derive an evidence-based, precise grading result. To validate CrossPath, we conducted a user study with pathologists in a local medical center. The result shows that CrossPath achieves a high level of comprehensiveness, explainability, and integrability while reducing about one-third time consumption compared to using a traditional optical microscope.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 9

page 10

page 11

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One critical step in cancer diagnosis and treatment is pathologists’ analysis of histological images obtained from a patient’s tissue sections to identify evidence of tumor cells and determine the grade (benign vs. malignant) based on medical guidelines.

Such an analysis is often challenging for pathologists, due to the sheer amount of effort it requires to identify sparse patterns of tumor cells given the ultra-high resolution of multiple histological images in a single patient’s case. Further, the process suffers from subjectivity due to the intra- and inter-observer variations, different interpretations of the grading guidelines, and different ways of sampling and examining the slides to ‘implement’ a given guideline.

To overcome these challenges, digital histology, enabled by advanced scanning techniques, promises to transform traditional microscopic analysis to digital visualization automatable by data-driven artificial intelligence (AI)

[19]. However, there remain three limitations of existing AI-aided histological analysis that prevent its adoption in clinical practice:

  • A lack of comprehensiveness: most AI models tend to focus on one specific criterion inferred from one specific type of histological data [6, 13, 20, 25, 29, 33, 36, 48] while in practice pathologists almost always rely on multiple criteria by examinations across multiple types of data;

  • A lack of explainability: most AI models function as ‘black boxes’ and provide little justifications for the generated analysis—such a lack of transparency creates a barrier that prevents AI from being accepted by pathologists;

  • A lack of integrability: most AI models abstract histological analysis as a computational problem, thus it remains unclear how to integrate such models into pathologists’ existing workflow and practices;

Figure 1: User interface of CrossPath: (a) WSI viewer with continuous magnification, the yellow box corresponds to the area of one high power field in the optical microscope, and the blue box is related to the selected evidence in the evidence list; (b) a user can verify each of the sampled evidence by clicking on ‘approve’, ‘decline’ or ‘declare uncertain’ button; (c) a heatmap can be enlarged to provide a global distribution of each criterion; (d) a list of sampled evidence generated by AI, the user can click on each piece of evidence to register each patch into the WSI on the left; (e) auto-generated suggested grading according to the WHO guideline where each criterion is examined by an AI model (f): both the suggested grading and the findings on individual criterion would be updated as the user override the AI’s results; (g) an arrow highlights the main contributing criterion to the current grading result; (h) the user can also shepherd the mixed AI models by overriding each histological feature manually.

To address these limitations, we develop CrossPath—a brain tumor grading tool for pathologists to perform top-down, cross data type, multi-criterion histological analysis by shepherding mixed AI models. Currently, we focus on the grading of meningioma—the most common primary type of brain tumor—as a point of departure for exploring the design of CrossPath. The goal is to aid pathologists to grade meningioma with the aid of AI, which assumes that tumor areas have already been identified on a slide.

Figure 1 shows an overview of CrossPath’s interface. A typical workflow follows a top-down path, starting with a pathologist first seeing the top-level suggested grading (Figure 1e) where an arrow (Figure 1g) highlights the main contributing criterion that leads to the suggested grading. The analysis of each criterion is produced by an AI model that examines a specific type of histological data (H&E, Ki-67) based on the World Health Organization (WHO) guideline. The pathologist can further select and drill down to a specific criterion, which retrieves a set of evidence (Figure 1d) to explain why AI arrives at its result. For example, for the criterion of mitotic count (Figure 1f), the pathologist can see patches of evidence (Figure 1d) sampled in the largest aggregation of mitosis. Moreover, the pathologist can open a heatmap (Figure 1c) overlaying the whole slide image (WSI) to overview the density distribution of positive mitotic cells recognized by the AI. Selecting a patch directs the pathologists’ attention to a high power field (HPF, Figure 1a, yellow box) of the mitosis on the original WSI, where they can further examine the low-level histological features and approve or decline AI’s analysis with one click (Figure 1b), which in turn will update AI’s results on individual criterion and, if necessary, the overall suggested grading as well based on WHO guidelines.

To validate CrossPath, a technical evaluation shows that the validation performance of AI models achieves an area under curve (AUC) of 0.842, 0.989, 0.923, and 0.834/0.673 in identifying the mitotic nucleus, necrosis, prominent nucleolus, and meningothelial/fibrous tissues separately. Moreover, CrossPath achieves an averaged error rate of and in nuclei counting (hypercellularity) and Ki-67 index calculation. A user study with seven pathologists shows that, with less than an hour of learning, pathologists were able to use CrossPath to make accurate grading decisions, with about one third reduction in time consumption compared to examining with the optical microscope. Pathologists’ qualitative feedback indicates that CrossPath obtains a high level of comprehensiveness, explainability, and integrability; meanwhile, the main challenge is that the current system does not support users to adjust the prediction threshold via the interface, and hence resulting in an amount of false-positive evidences that might influence the suggested result. Another challenge is that participants were not used to digital WSI examination since the size of cells varies from computer monitors because of the extra digital zoom in the interface.

1.1 Contributions

This paper makes a tool contribution to AI-aided pathology, addressing the three key issues of comprehensiveness, explainability, and integrability.

  • CrossPath addresses comprehensiveness by incorporating multiple WHO-guided criteria across different data types;

  • CrossPath addresses explainability by at the top level, presenting grading logic based on the WHO guideline; at the mid-level, providing specific examples with features leading to AI’s result for each criterion at the low level, registering evidence into WSI to reveal more contextual features;

  • CrossPath addresses integrability into existing workflow by mimicking how an attending pathologist oversees trainees’ work, allowing a pathologist to shepherd mixed AI models following a top-down path to examine individual AI’s results with evidence on-demand guided by heatmap visualizations.

2 Related Work

CrossPath focuses on an interactive Computer-Aided Diagnosis system, which helps pathologists discover multiple histological characteristics of high-grade meningioma, with supporting evidence and visualization of AI-generated results. Correspondingly, the related work can be categorized into three areas: interactive tools for digital histology; AI frameworks for digital histology; human-AI collaborative platforms.

2.1 Interactive Tools for Digital Histology

Digital histology often deals with high-resolution whole slide images. Beyond involving AI trained by domain experts, there are a variety of tools for pathologists to define, explore, and decide upon clinical or research usage.

When it comes to digital histology, many tools are designed primarily for studying a localized area, as opposed to using multiple data types to examine patients’ biopsy section at the whole slide level. ImageJ [39] is one of the most commonly used for scientific image analysis and has been extended by computational medicine research. The platform provides essential image processing tools to allow users to perform simple tasks, such as nuclei segmentation. Over its 30 years of evolution, various distributions [11, 35, 38], and plugins [22] have been increasingly integrated into the system, making it the most widely used software in digital histology.

Besides ImageJ, others have focused on enhancing the WSI accessibility from the perspective of non-computational users. For example, cellprofiller [9] assists non-computational users to perform complex morphological assays automatically without writing any code quantitatively. caMicroscope [37] enables a user to run a segmentation pipeline in a selected area and to analyze nuclei characteristics in whole-slide images. QuPath [5] provides extensive annotation and automation analysis tools for pathologists, such as nuclei segmentation, positive cell counting. Pathology Image Informatics Platform (PIIP) [28] extends the capabilities of Sedeen viewer111https://pathcore.com/sedeen/ by adding plugins on out-of-focus detection, region of interest transformation, and immunohistochemical (IHC) slide analysis.

Although some systems mentioned above provide a certain level of automation to reduce pathologists’ workload, a majority of them do not support examining WSI comprehensively at a full-slide level. As a result, such tools are hard to integrate into pathologists’ workflow.

2.2 AI Frameworks for Digital Histology

Given the abundance of histological images, data-driven AI can potentially be integrated into an automatic analysis pipeline to help save medical professional’s time and effort. Given the high resolution as well as a large amount of the digital histology data, the mainstream frameworks incorporate AI algorithms in Computer-Aided Diagnosis, content-based image retrieval (CBIR), as well as discovering relationships among various features

[23]. Amongst these three primary applications, using AI for identifying regions-of-interest for Computer-Aided Diagnosis, identifying mitosis, is most related to CrossPath.

Current pathology AI frameworks are majorly based on H&E slides, focusing on detecting individual histological features. Irshad   includes selected color spaces and morphological features into mitotic cell detection pipeline to support breast cancer grading [21]. Lu  use Bayesian modeling and local-region threshold method to detect breast cancer mitotic cells [27]. Mishra  propose a CNN network to identify variable vs. necrosis osteosarcoma tumor tissues [30]. Sharma  introduce a CNN network for both cancer classification from immunohistochemistical (IHC) response and necrosis region detection in gastric carcinoma [41]. Veta  propose a nuclei segmentation pipeline with pre-processing, watershed segmentation, and post-processing steps for breast cancer nulcei segmentation [45]. Zhou  adapt a deep-supervised U-Net model with skipped pathways to perform nuclei segmentation and enhance the performance in comparison to traditional U-Net[50]

. Yap  use RankBoost-permutations to integrate multiple base classifiers to detect prominent nucleoli patterns in prostate cancer, breast cancer, and renal cell papillary cancer

[49].

Several AI frameworks are also proposed based solely on Ki-67 immunohistochemistical (IHC) tests. For example, Saha  use CNN as a feature extractor with Gamma Mixture Model to detect immuno-positive and negative cells for breast cancer [36]. Xing  train a Fully Connected Convolutional Network that can perform cell segmentation and identification in a single stage [48]. Anari  utilize fuzzy c-means clustering to extract positive and negative cells in Ki-67 slides for meningioma tissues [3].

Although these histological image analysis frameworks report satisfying performance on par with human beings, one main problem is that they are majorly focused on single criteria rather than integrating multiple at the same time. Hence, the promise of implementing AI for a reduced workload is obscured when doctors expect diagnosis to be based on a comprehensive list of criteria. Another limitation is that prior work is mainly based on single data type, such as H&E images, as input. In contrast, doctors in clinical practices usually refer to multiple IHC examinations to perform a differential diagnosis [15, 42]. Further, the non-transparent, non-explainable characteristics of the end-to-end AI algorithms would limit their applications in high-stake medical decision processes, such as tumor grading.

2.3 Human-AI Collaborative Platforms

Although the promise of AI has enlightened the horizon of freeing humans from tedious tasks, applying it into high-stake applications, such as medical decision processes, is a challenging task. Thus it is important to enable human-AI collaboration in various domains including medicine.

In a broader context, recent research in HCI has demonstrated a plethora of examples of human-AI collaboration. For example, Forté enables a user to modify optimization’s result, which serves as input for the next iteration to reflect the user’s intent [10]. Willett  propose a mixed-initiative tool to enable novice users to convert a still picture to animation by providing a few user drawn scribbles as input, saving users’ time and effort [47]. Lee  implement a simulation platform that accepts high-level design commands and generates corresponding virtual mannequins in real-time [24].

Research also has shown that human-AI collaboration can enhance tasks in the medical domain. For example, Cai  propose a CBIR system where doctors can query similar histological images with a user-adjusted refine-by-concept tool to support differential diagnosis with digital histology [7]. Further, integrating human domain knowledge into the AI pipeline would enable humans to ‘supervise’ and enhance AI via interactive machine learning. Apart from saving human’s effort, this could also help eliminate AI’s erroneous predictions in high-stake medical applications. For example, Ilastik [43]

enables the user to draw strokes over images for training segmentation models. The system can automatically recommend the most critical features to reduce overfitting. HistomicsML is an active learning

[40]

system that dynamically queries the most uncertain patches from a random forest classifier, thus allowing pathologists to refine the classification model with fewer samples iteratively.

To sum up, the systems mentioned above succeed in reducing human workload by computer automation. However, little prior work has explored the collaborative relationship between pathologists and AI or how to integrate AI into pathologists’ existing workflow. To fill this gap, our research investigates how a pathologist would corporate with AI in a multi-criteria medical application by enabling the physician to ‘shepherd’ AI in a top-down, evidence-based manner.

3 Background of Meningioma

We focus on meningioma grading task to probe the design of medical human-AI collaborative systems. According to WHO guidelines, meningioma can be graded as I (benign), II (atypical), and III (malignant). Grade I meningioma are recognized by their histological subtype and a lack of anaplastic features. Grade II meningiomas are defined by one of the four following criteria: 4 to 19 mitotic (Figure 2a) count in 10 High Power Fields (HPF); three out of five histological features: hypercellularity (Figure 2b), prominent nucleoli (Figure 2c), sheeting (Figure 2d, which only exists in meningothelial subtype), necrosis (Figure 2f), small cell (Figure 2g); brain invasion (Figure 2h); clear cell or chordoid histological subtype. Grade III meningiomas have 20 or more mitosis per 10 HPFs or have the histological appearance of carcinomas, melanomas, or high-grade sarcomas. [4, 26].

Figure 2: Histological features of meningiomas that CrossPath is able to process: (a) mitotic cell (marked in the red box); (b) hypercelluarity (abnormal excess of cells marked in the red box)); (c) prominent nucleoli (enlarged nucleoli marked in the red box); (d) meningothelial subtype (sheeting can only presence in this subtype); (e) fibrous subtype; (f) necrosis (irreversible injury to cells marked in the red box); (g) suggestive of small cells (tumor cells with high nuclear/cytoplasmic ratio); (h) suggestive of brain invasion (invasive tumor cells in brain tissue); (i) Ki-67 proliferation index (the positive cells appear brown by chromogenic immunohistochemistry).

Importantly, the clinical treatment of grade I and grade II/III is different: the grade I lesions can be treated with either surgery or external beam radiation, while grade II/III lesions often need both [46]. Further, a study shows that grade II/III patients experience worse prognosis even after critical treatments [32].

Given the treatment and prognosis difference amongst different tumor grades, it is crucial to support meningioma grading. The first and foremost way to examine meningiomas is by examining tissue slides with H&E staining. In addition, Ki-67 staining (Figure 2i) is often used as an additional reference since the Ki-67 proliferation index is related to mitosis and is reported as positively correlated to meningioma grading [2]. Such multiple criteria help pathologists examine comprehensively on meningioma grading; an erroneous grading comes with a high cost: either an overestimated case would incur unnecessary treatment on patients, or an overlooked case would result in a delay of necessary treatment.

4 Formative Study

Tizhoosh  summarized ten challenges in AI-aided digital histology at a high-level [44]. The paper mentioned that diagnostic tasks are ‘non-boolean’ and diagnosis cannot be over-simplified into a ‘yes’ or ‘no’, as machine learning tasks usually do. Tizhoosh  also point out that the current uni-tasked, weak AI cannot perfectly achieve complex, multi-criteria tasks. Finally, a lack of transparency and interpretability make data-driven black-box algorithms hard to understand and hence result in an untrustworthy diagnosis. To further understand where and how those pitfalls might occur in AI-aided pathology, we conducted a formative study with four pathologists in a local medical center.

We started by describing the motivation and mission of the project. We then asked pathologists to describe their typical process of examining a patient’s case. Next, we asked what were the major challenges in existing pathology practice, and further inquired about their expectations on AI-aided automatic diagnostic systems.

4.1 Existing Challenges for Pathologists

We found that there are two major challenges in current pathology practice in the grading of meningioma.

Time consumption For grading of meningioma, the sparse characteristics and narrow view of high power magnification result in the time-consuming process of examining the slides for the pathologists. A biopsy section from a patient brain tissue would generate eight to twelve H&E slides. Pathologists need to look through all those slides and integrate the information found on each slide. Specifically, pathologists need to follow the sampling guideline and search for histological features in the high power field. While the grading of meningioma is a relatively easy task for experienced pathologists, for the majority of others, the grading is extremely time-consuming and tedious, taking up to one or several hours to go through a single patient’s case.

Subjectivity We found three factors that contribute to subjectivity: subjectivity due to a lack of precise definitions: the WHO guidelines do not always provide a quantified description for the five histological features for high-grade meningioma. For example, the ‘prominent nucleoli’ criteria requires pathologists to report prominent nucleolus clusters. However, the WHO guideline does not specify how large the nucleolus should be to make them recognized as ‘prominent’. Hence, pathologists often rely on their own experience to decide; subjectivity in implementing the examination process. For example, the mitotic count for grade II meningioma is defined as >4 mitotic cells in 10 HPFs. However, the guideline does not specify the sampling rules of finding those 10 HPFs. As a result, different pathologists are likely to sample different areas on the slide; subjectivity due to the personal factors, such as the level of experience, time constraint, and fatigue [12]

. The subjectivity in the current pathology workflow would result in inter-observer variance in diagnosis, and this would be fatal given the high stakes in medical diagnosis.

4.2 System Requirements for AI-Aided Examination

Based on the challenges mentioned above, we identify the following system requirements based on further discussions with pathologists.

Comprehensiveness The time consumption in current practices is in part due to the need of examining multiple criteria, especially to justify high-grade meningioma, hypercellularity, necrosis, small cell, prominent nucleoli, and sheeting. To reduce workload and time spent for examination, the system should similarly provide support on all these different criteria based on multiple types and sources of data.

Explainability One important way to overcome subjectivity is making the grading process transparent. The system should show visual summarization of AI’s automated process and recommend to pathologists highly-suspicious regions-of-interest for non-quantified features in WHO guidelines. Another important approach is to provide evidence-based justification—evidence for each criterion found by AI that can be cross-checked by pathologists.

Integrability Given its existing limitations, the system should enable pathologists to collaborate with AI to conclude a grading diagnosis. Specifically, one suggested way of integration is to ‘shepherd’ AI at the high level—by examining human-readable intermediate results, and validate, cross-check with peers, or override AI’s findings. Such a shepherding relationship is similar to how currently attending pathologists delegate work to and oversee their trainees.

5 Design & Implementation

Corresponding to the meningioma grading criteria suggested by WHO [26], we first worked with pathologists and constructed five meningioma datasets for training. Then we train mixed AI models (detailed below) to examine a total of eight criteria. We developed a front-end interface to integrate the evidence found by AI into the H&E and Ki-67 slides, which we describe in this section.

5.1 Comprehensive Examinations of Multiple WHO Criteria

To achieve comprehensiveness, we followed the WHO meningioma grading guideline and automates eight criteria for meningioma grading, mitotic count, Ki-67 index, hypercellularity, necrosis, small cell, prominent nucleoli, sheeting, and brain invasion. We constructed five datasets with pathologists for model training. Specifically, we constructed a mitotic nuclei dataset for mitotic cell detection, a nuclei segmentation dataset for hypercellularity detection, a necrosis dataset for necrosis detection, a prominent nucleoli dataset for prominent nucleoli detection, and a meningioma subtype dataset (meningothelial, fibrous, other) for sheeting pattern recognition.

In order to train the models, we first split the five datasets into training and validation subsets. Then, we trained a mitotic nuclei identification model, a nuclei segmentation model (for hypercellularity), a necrosis identification model, a prominent nucleoli identification model, and a sheeting identification model with the corresponding training sets. For the Ki-67 index, we used positive cell count function from [16] to detect both Ki-67 positive and negative nucleus. Due to a lack of annotated data, we selected the top-10 -pixel patches that have the highest nuclei count within each slide as small cell recommendations. Finally, we used a rule-based method to differentiate the tumor and non-tumor areas for brain invasion criterion. Please refer to Appendix A for implementation details.

To better organize AI results on the eight criteria, we worked with pathologists to obtain their examination priority. Given the current limitations of AI, CrossPath splits all eight criteria into two categories based on WHO guidelines: deterministic and suggestive. For the highly-prioritized criteria with quantized values, mitotic count, Ki-67 index, CrossPath demonstrates deterministic values directly (Figure 3

b). For the rest of the criteria (that come with present/not present), CrossPath provides recommendations to pathologists. Specifically, CrossPath calculates regions-of-interest (ROIs) that present the probable local regions in four histological criteria (hypercellularity, necrosis, small cell, prominent nucleoli). Each criterion follows different ROI sampling rules, and the areas of each ROI might vary from one to another. For example, a hypercellularity ROI is defined as a cluster that has more than 2000 nuclus/1HPF (Figure

4e). Please refer to Appendix B for details on other criteria’s sampling rules. Figure 4 (e, f, g, h) highlights the ROI samples. With the aid of ROI samples, pathologists can prioritize their examination on an area of interest with respect to a specific criterion without exhaustively scanning the WSI looking for a starting point.

5.2 Explanation by Example Using Sampled Evidence

Crosspath provides a top-down, three levels of explanations to illustrate a high-level line of reasoning without missing connections to localized areas.

5.2.1 High-Level Logic

At the high-level, CrossPath provides two visual cues to explain the how the system follows the WHO guidelines: first, CrossPath displays an arrow that highlights the main contributing criterion. As shown in Figure 3b, the arrow indicates that the current suggested ‘WHO grade II’ result (Figure 3a) is generated based on the mitotic count criterion. Second, CrossPath uses different color bars to distinguish the aforementioned deterministic vs. suggestive criteria. Specifically, deterministic criteria have red or green color bars where red is for findings that indicate higher grade (Figure 3c); the suggestive criteria display orange bars. For the orange suggestive criteria, pathologists can drill down to examine mid-level samples. If a pathologist sufficiently overrides AI’s results, the suggestive criteria will be changed to deterministic and the color bar would be updated correspondingly. The gray bar indicates that the corresponding criterion is not available in this case.

5.2.2 Mid-Level Explanation-by-Example Sampling

To better help pathologists focus on localized regions and increase intra-observer consistency, CrossPath samples and demonstrates evidence found by AI. For example, for the most important mitosis criterion, CrossPath provides the following two ‘shortcuts’ for pathologists to look into evidence of AI’s results.

Figure 3: High-level logic of CrossPath: (a) the overall suggested grading generated automatically; (b) a structured overview of each WHO criterion, the arrow highlights the main contributing criterion to the suggested grading. For each criterion, the red bar stands for the criterion has a deterministic value while orange ones still need manual examination. The gray bar indicates the criterion is not available in this case; (c) users can override the orange criteria by right-clicking on each item, and change to ‘found’, ‘not found’ or ‘uncertain’. The overridden criterion becomes deterministic after manual verification, and the suggested result is updated accordingly.
Figure 4: Selected pieces of sampled evidence from in-the-wild detection where we applied trained models directly on multiple H&E and Ki-67 slides: (a) a highest focal region sampling result of mitotic count on H&E slide (red box, 1HPF), the small blue frames are the -pixel evidence patches (that are shown on the evidence list); (b) a highest focal region sampling result on Ki-67 slide (red box, 1HPF); (c) a highest region sampling result of mitotic count on H&E slide (red box, 10HPF), the small blue frames are the -pixel evidence patches; (d) highest region sampling result on Ki-67 slide (red box, 10HPF); (e) a hypercellularity ROI sample; (f) a necrosis ROI sample; (g) a small cell ROI sample; (h) a prominent nucleoli ROI sample.

Highest Focal Region Sampling From our formative study, we found that the high-grade meningiomas share a common feature of increased mitosis in a localized area. Hence, CrossPath provides the highest focal sampling tool to help pathologists better localize highly concentrated mitosis/Ki-67 index areas. In CrossPath, the highest focal region is calculated as the 1HPF that has the most mitosis (Figure 4a) or highest Ki-67 index (Figure 4b). With the aid of the highest focal region sampling tool, pathologists can verify AI’s conclusion of a worst-case count of mitotic activity.

Highest Region Sampling One criterion of the WHO meningioma grading guideline is “mitotic count in 10 consecutive HPFs”. In our formative study, we found that the inter-observer consistency of “10 consecutive HPFs” is low. In order to improve inter-observer consistency, CrossPath provides the highest region sampling tool, where the highest region is defined as a area that contains most mitotic nucleus (Figure 4c) or the highest Ki-67 index (Figure 4d). The highest region sampling tool speeds up a pathologist’s work by helping them locate an area of interest that consist of 10 consecutive HPFs, where they can then see evidence of AI’s examination on these 10 HPFs.

Figure 5: CrossPath supports registering each piece of the mid-level evidence (a) into WSI (b) to enable pathologists to examine in high magnification (c). The numbers in the box indicate the average prominent nuclei density.

5.2.3 Low-Level Registration

CrossPath supports registering each piece of the mid-level evidence into WSI to enable pathologists examine in even higher magnification. For example, as shown in Figure 5, pathologists can select the AI-generated prominent nucleoli ROI (a) as CrossPath registers the corresponding evidence into the original WSI (b), and pathologists can examine the evidence in high magnification for low-level details (c).

Figure 6: A typical workflow in CrossPath. Users first start from the final result (a), then examine the main contributing criteria (b). They can further examine the evidence list (c), and register back into the original WSI in high magnification (d,e). Furthermore, users can approve/decline the evidence (f) and repeat (c-f) until they feel they have gained sufficient confidence for a grading diagnosis for the rest of the criteria (g).
Figure 7: Selected heatmaps from in-the-wild detection where we applied trained models directly on multiple H&E and Ki-67 slides, showing (a) mitotic count density; (b) Ki-67 index values; (c) nuclei density; (d) necrosis probability; (e) prominent nucleoli density; (f) meningothelial vs.  fibrous subtype; (g) tumor vs. non-tumor tissues.

5.3 Shepherding AI to Integrate into How Pathologists Work

Building off of the explainable design, Crosspath establishes a shepherding workflow based on how attending physicians oversee trainees’ work.

A top-down shepherding workflow As shown in Figure 6, pathologists’ shepherding workflow with CrossPath is top-down, and it starts from the top-level suggested result (Figure 6a). Then, pathologists continue to examine the main contributing criterion (Figure 6b), and further the evidence for the criterion (Figure 6c). Next, pathologists shepherd AI by drilling down and tracing back to the original area on the WSI (Figure 6d,e). With CrossPath, pathologists can make justifications on each evidence by clicking on the approve/decline buttons (Figure 6f). For other criteria (Figure 6g), pathologists can repeat the evidence examination workflow (Figure 6 (c-f)) until they have gained sufficient confidence for a grading diagnosis.

Heatmap visualization CrossPath shows a mitotic heatmap (Figure 7a) with mitotic count density, a Ki-67 heatmap (Figure 7b) with Ki-67 index value, a nuclei heatmap (Figure 7c) with nuclei density, a necrosis probability heatmap (Figure 7d) with necrosis probability, a prominent nucleoli heatmap (Figure 7e) with prominent nucleoli density, a sheeting heatmap (Figure 7f) that shows meningothelial vs. fibrous subtype, and a brain invasion heatmap (Figure 7g) with tumor vs. non-tumor regions. If the mid-level evidence samples are insufficient for them to verify AI’s results, pathologists can go beyond the sampled areas and navigate the high-heat areas using the heatmap. Importantly, the heatmap would be used as a ‘screening tool’ to help pathologists rapidly narrow down to a localized region for more evidence without having to scan the entire WSI.

Modifying AI results Given the non-perfect behavior of the AI, pathologists can further shepherd AI by clicking on the approve/decline buttons (Figure 6f) after they have examined the evidence within the WSI context. For the five histological patterns as well as brain invasion criteria, pathologists can directly override the AI-recommended results by right-clicking on each criterion (Figure 3c). Correspondingly, the overall suggested grading will be updated if the manually-overriden grading indicates a different result based on WHO guidelines.

6 Technical Evaluation

CrossPath uses eight models to detect various histological patterns that are helpful for pathologists to reach a diagnosis. In this section, we report the performance of AI with the validation dataset. Specifically, we validated the mitotic, necrosis, prominent nucleoli, and sheeting criteria with a corresponding validation set (mitotic count: 39 positive, 80 negative; necrosis: 52 positive, 519 negative; prominent nucleoli: 19 positive, 57 negative; sheeting: 91 meningothelial, 70 fibrous, 101 other) and report ROC curve (Figure 8) and AUC (area under curve) score of each model. Because the tasks are majorly cell-counting in hypercellularity and Ki-67 index criteria, we validated their performance with 20 randomly-selected patches and report the average error rate.

Figure 8: Classification performance for (a) mitotic detection, (b) necrosis detection, (c) prominent nucleoli detection, (d) sheeting detection (other/meningothelial/fibrous detection). The solid lines in each sub-figure illustrates the ROC curves of each model. The red dashed-lines represent random-guess performance. The legend in each sub-figure reports the identification AUC of each class of the corresponding model.

In summary, as shown in figure 8, CrossPath achieved AUC of 0.842, 0.989, 0.923, and 0.834/0.673 in identifying mitotic nucleus, necrosis, prominent nucleolus, and meningothelial/fibrous tissues separately. The average error rate of nuclei counting (hypercellularity) and Ki-67 index is and , respectively.

Due to a lack of data at present, for brain invasion and small cell patterns, rather than drawing a conclusion, CrossPath uses a rule-based, unsupervised approach to recommend areas for pathologists to examine and we discuss the performance on these two criteria later by referring to how pathologists in our study made use of the recommendation.

7 Work Sessions with Pathologists

We conducted work sessions with pathologists using CrossPath. The main research questions are:

  • RQ1: How do pathologists interact with CrossPath? How do they process AI-generated results across multiple criteria? How do they navigate the top-down structure to view AI’s results and evidences that support such results?

  • RQ2: Does CrossPath reduce pathologists’ effort compared to their existing workflow? How much time would a pathologist spend on one WSI in CrossPath compared existing practice? How much cognitive effort do pathologists perceive using CrossPath compared to manual examination?

  • RQ3: Does CrossPath add value if adopted to pathologists’ existing workflow? What is pathologists’ perceived value of CrossPath: what works better than manual examination and what does not?

7.1 Participants

We recruited seven pathologists from a local medical center. The participants had experience in pathology from one to 22 years (mean=6.3), including three attendings and four senior residents. Three participants had a background in machine learning, and three had experience of digital whole slide imaging interfaces. Three of the participants examined meningioma slides weekly, three of them had done within half a year, and one of them had not practiced grading meningioma for over a year.

7.2 Data & Apparatus

We collected three meningioma cases (two grade II, one grade III) from a local medical center. The ground truth grading decision were recognized by a board of pathologists. Due to the pathologists’ limited availability of time, we only used a small number of slides in each case: the two grade II cases had both one H&E and one Ki-67 slide, while the grade III case only had one H&E slide. Due to the COVID-19 pandemic, all sessions were conducted online. Pathologists used Zoom’s remote control to interact with CrossPath that ran on an experimenter’s computer.

7.3 Task & Procedure

We first introduced the background of CrossPath and provided a detailed walkthrough of the system for each participant using a sample case that have one H&E and one Ki-67 slide. After that, we ran a timed section for the pathologists to grade the three meningioma cases that contained three H&E slides and two Ki-67 slides in total. For each case, the time was counted from when participants first click the WSI case until they had reached the grading diagnosis. After participants had finished grading all three cases, we asked participants to self-report their estimated time consumption with the three cases using the optical microscope. In this study, we did not compare CrossPath with traditional digital interface because participants’ experience with such interface varied. A comparison also would not fit into the limited time each pathologist was available to spend in our work session. After the participants had examined all the cases, we conducted a semi-structured interview to elicit pathologists’ response to CrossPath’s perceived effort and added-value. The average duration of each work session was about 50 minutes.

7.4 Measures

We measured CorssPath’s usability through five-level Likert scale on various topcis, listed as follows:

  • Comprehensiveness: Participants answered the following three questions: C1:"The multi-criterion H&E features found by the system are useful to assist the diagnosis", C2:"The system gives a global view of histological features", C3:"Overall, the system is comprehensive";

  • Explainability: Participants answered the following three questions: E1:"The evidence found by the system can explain the summative histological features", E2:"The suggested WHO grading generated by the system is explainable", E3:"Overall, the system is explainable";

  • Integrability: Participants answered the question: I1:"It is easy to integrate the AI evidence and result into the current clinical process"

  • Workload: Participants answered questions followed by NASA Task Load Index [17] in three dimensions (W1: mental demand, W2: temporal demand, and W3: effort);

  • Integrity: Participant answered the question: T1:"Overall, based on the shown case study, I believe the system’s suggested diagnosis matches my justification"

  • Future use: Participants answered the question: F1:"If approved by the FDA, I will continue to use the system in my practice";

8 Findings

In this section, we summarize the result as well as the recurring themes found in the working sessions. We further discuss three design implications for future AI-aided diagnostic systems.

8.1 Diagnosis Consistency & Time Consumption

With the aid of CrossPath, all participants agreed with the two grading II cases, as suggested by CrossPath. For the grade III case, two (P4, P6) agreed with the machine-suggested grade III, four (P1, P2, P3, P7) agreed that the case belongs to a high-grade meningioma (grade II/III) but need more verification from the glass slide, and one (P5) downgraded the suggested diagnosis to grade II by overriding a number of mitotic counts shown in the evidence panel.

The timed session shows that each pathologist spent 5.61 minutes (std=1.17) on average on each slide, which is lower than manual examination with traditional glass slide (8.43 minutes). Such a time reduction should factor in that pathologists were still learning CrossPath where the time spent on each slide included asking for clarifications and questions. Further, it should be noted that the time reduction was achieved despite the latency issues caused by the remote study setup.

8.2 Qualitative User Feedback and Recurring Themes

Table 1 summarizes the metrics of user response from the post-study. Based on our observations from the work sessions with pathologists, we present our findings below, which are summarized into five themes.

How pathologists use CrossPath’s multiple criteria: prioritizing one, referring to others on demand
We observed that if a specific criterion did not satisfy the bar of pathologist to make a diagnosis for a higher grading, pathologists would use CrossPath to browse other criteria, looking for evidence of a differential diagnosis, until they identify sufficient evidence to support their hypothesis. P3 valued such an availability of multiple criteria: "it’s kind of a safety net to prevent me from under grading" (P3).

* 1: Strongly disagree – 5: Strongly agree Question 1 2 3 4 5 Mean C1 - - - 3 4 4.57 C2 - - - 4 3 4.43 C3 - - - 4 3 4.43 E1 - 1 1 3 2 3.86 E2 - - 1 2 3 4.33 E3 - 1 1 3 2 3.86 I1 - - - 4 3 4.43 W1 - - - 3 4 4.57 W2 - - 1 4 2 4.14 W3 - - - 5 2 4.28 T1 - - 2 4 1 3.86 F1 - - 1 4 2 4.14

Table 1: Participant response with Likert scores. The numbers indicate how many of the participants rated the score. Note that one participant did not answer question E2.

In the timed section, we observed that some pathologists started with focusing on some specific criteria to ascertain a probable diagnosis as quickly as possible. "Looks like there is a brain invasion, and I agree with grade II" (P1). "…to be certain, I would like to go over the glass slide for the five histological features, but for me, it is enough for grade II" (P2). "I didn’t look at this whole side by myself. But if I’m using just CrossPath and I believe the mitosis looks with a grade II" (P6).

However, some pathologists would also like to see other criteria and examine the slide comprehensively— "That already is grade II by on the mitotic count, but we would also probably go look for these other criteria" (P3). "What I would want to do is take a look at the rest of it to make sure I’m not missing anything" (P4). I think I have already agreed based on the four mitosis in the HPF for grade II, but I still want to try and navigate the other functions (P5).

Pathologists did not treat all criteria provided by CrossPath equally. Pathologists would prioritize examining a specific criterion, brain invasion (P1), the outcome of which would then guide them to selectively find other criteria for making a differential diagnosis. Such a relationship between criteria is analogous to ‘focus + context’ in information visualization [8]—different pathologists might focus on a few different criteria, but the other criteria are also important to serve as context at their disposal for supporting an existing diagnosis or finding an alternative one.

How pathologists use CrossPath’s multiple criteria: overlaying and cross-checking one another
Although the WHO guideline does not specify a quantified Ki-67 cut-off for meningioma grading, we observed that participants were able to incorporate the additional Ki-67 information with H&E using CrossPath (P2, P4, P6). More specifically, users overlapped the high Ki-67 index regions with the mitotic density heatmap to examine whether the AI result is accurate at a high-level. We also noticed a pathologist (P2) could use the Ki-67 information provided by the system to improve their diagnosis— "It did not really look like a grade I as I started with, …, I think the mitosis and the higher Ki makes me think it is not a grade I" (P2).

CrossPath’s top-down flow enables pathologists to navigate between high-level AI results & low-level WSI details
Amongst the seven histologic criteria provided by CrossPath, participants seem to all agree that automating the mitotic count criteria would best save their time— "looking for mitosis is hard so that program really speed up [what] we’re actually looking at" (P6). One of the main reasons that the limits the throughput of histological diagnosis is that criteria like mitotic count presents very small-scaled histologic features. As a result, pathologists have to switch to high power magnification to examine such small features in detail. Given the high resolution of the WSI, it is possible to ‘get lost’ in the narrow scope of HPF, resulting in a time-consuming process to go through the entire WSI (low throughput). With CrossPath, pathologists found its top-down flow uses specific evidence to bridge high-level AI results at a global overview and WSI’s low-level details in a narrow scope. "You really have to go to high power to search for all the mitosis. So the fact that it (CrossPath) highlights the highest region and then you can just quickly look through that area or those evidence boxes. I think that’s really helpful" (P5).

CrossPath’s explainable design helps pathologists see what AI is doing (wrong)
One way of overcoming the subjectivity is promoting transparency. Given the non-perfect behavior of the AI algorithms, providing evidence-based justification was shown to be helpful for pathologists to utilize AI’s results— "It (CrossPath) shows areas of interests, I think it is most helpful for us, cause if it shows there is necrosis, and it does not show us where, then we cannot trust it" (P1). In CrossPath, users encountered false-positive pieces of evidence in mitotic and necrosis criteria. The sampled evidence provided by CrossPath helped the users to understand what patterns that the AI was trying to pick up— "For necrosis, I think it’s looking for pink type stuff, but it’s making the mistake of picking up on some collagen, and then there are probably not necrosis… so even when it’s wrong it’s explainable" (P3). We also observed that explainablility helped pathologists understand AI’s limitations and how false-positive cases would occur— "Even under the microscope, a lot of apoptotic bodies can look like mitotic figures, and I think the system was bringing up a combination of those two … Things like that so I can understand why the system was doing what it was doing" (P4). "I think the computer is trying to find the dense dark areas because that’s kind of essentially what a mitotic figure is. But of course, that’s going to catch other things— sometimes it’s not it could just be like an apoptotic body" (P7).

Shepherding AI with CrossPath: pathologists are able to but do not often confirm or override AI’s result
Given the explainable evidence provided by CrossPath, it was easy for pathologists to recognize false-positive cases and rapidly override them— "I’m going to downgrade this to a WHO grade II based on the histologic features vs. the mitotic. Because I think this thing (the AI) picked up more of the neutrophils rather than the mitosis" (P5). However, pathologists did not often confirm/override AI’s results on specific piece of evidence by clicking on the approve/decline buttons or modifying AI results directly on the criteria panel. We hypothesize that this is in part related to how pathologists generate their clinical report: pathologists make comments on founded histological features if they have found them. When they were using CrossPath, pathologists could pull the correct piece of evidence into their report directly; otherwise, if the pathologists regard the piece of evidence as false-positive, by training they do not need to do anything since it would not appear in the report. On the system level, one way to incentivize pathologists’ input is to make the manually overridden evidence more useful: the AI can pick up the new annotations and retrain the AI model. Correspondingly, the system updates the new evidence interactively and shows the pathologists better and refined evidence based on their annotations.

Another tool that CrossPath provides to help pathologists shepherd AI is the heatmap visualization of the criteria. In our study, the users expressed that the heatmap visualization is helpful for them to navigate the WSI and select where to look at when verifying AI’s result, even though it took some effort to understand it— "I think the heat maps are helpful, but I think it took me a little bit to understand what exactly was trying to do" (P4). "It (the heatmap) light up wherever they (mitosis) are" (P5). "I feel like the sheeting, I thought that was a really cool idea that the machine could kind of show where the tissue looks the same versus different" (P6).

9 Discussion

We discuss three design implications based on the user feedback from CrossPath. These implications are not limited to meningioma grading and are expected to generalize to other medical applications, such as AI-aided histological image processing, or X-ray/CT/MRI image processing with AI. We further discuss limitations that lead to potential future work to improve the current design and implementation of CrossPath.

9.1 Summary & Implications for Design

Focus+context design Medical diagnosis involves accumulating evidence from multiple criteria—our study observed that pathologists started with a focus on one criterion while continue to examine others for a differential diagnosis. Thus medical AI systems not only should make multiple criteria available, but also should support the navigation of such criteria following a ‘focus+context’ design [8]. One opportunity is to design ‘focus+context’ visualization of multiple criteria, where the major challenge is to strike a balance between juxtaposing the focused criterion with sufficient amount of contextual criteria without overwhelming the pathologists with too much information. It is also possible for a system to, based on a patient’s prior history and a preprocessing of their data, recommend a pathologist to start with a focus of certain criteria followed by examining some others as context.

Providing evidence for AI’s result Given the current limitations of AI algorithms, AI-aided diagnostic systems should provide evidence for physicians to verify AI’s results. Similar to CrossPath, the system can provide multi-level explanatory evidence. For example, CrossPath offers three levels of explanations: a high-level logic based on WHO guidelines, a mid-level explanation-by-example evidence sampling, and a low-level registration to specific WSI features. The multi-level evidence can be especially helpful for the diseases that follow multiple criteria and are diagnosed with rule-based guidelines. With multi-level evidence, the users can navigate in a top-down manner and rapidly obtain an estimation of AI performance as well as mistakes, which also supports gaining trust of AI-aided diagnostic systems.

Making interactions useful The ultimate goal of medical AI is to reduce the workload of medical professionals and help them reach a diagnosis faster. While the fully automated diagnosis is still far under the horizon, the current AI-aided diagnostic interface should be human-centered and integrable to specific domains, rather than AI-centered and abstracted from real-world scenarios. Hence, interactive features on an interface should offer medical users a clear, straight-forward and actionable implication without much extra effort to understand them. In CrossPath, we found pathologists tended not to use the ‘approve/decline’ and ‘modifying AI result’ functions, as the interaction of using such functions yield little benefits compared to others (seeing more evidence of AI’s results). To overcome such limitation, making interactions clearly useful could be a solution: the AI can follow the physician-annotated results and update the diagnosis and evidence accordingly, which could serve to better incentivize physicians.

9.2 Thresholding, False Positives and False Negatives

Currently, CrossPath does not support adjusting the threshold with the front-end directly. In our user study, a participant is interested whether they can adapt the prediction threshold – "Some of the (mitosis) is hard to be certain, each pathologist has his own threshold" (P2). Furthermore, dealing with false positives and false negatives is another issue with a fixed-threshold scheme as the system is. From our study, we found out that that doctors would prefer high-sensitive results that include some false positives rather than low-sensitive results that have false-negatives – "I guess it’s better to do false positives and false negatives because we can always do the double-check like being more sensitive like that" (P5). Therefore, in future work, the system should enable adaptive thresholding and allow pathologists to adjust the threshold on the front-end. Further, the system should also provide as many ROIs as possible without fearing to show doctors false-positive results, since doctors are fast in examining ROIs whereas missing important features would come with a high cost.

9.3 Human-AI Collaboration: Where Can They Really Meet?

Identifying what AI is good at vs. what human is good at has been a long-standing challenge in the HCI/AI communities. In our user study, we found that the pathologists did not treat all criteria equally, and they expressed that some automation (the brain invasion, for example) is not as helpful as others. A naïve solution would be partitioning the tasks into two dimensions, human-preferable tasks and AI-preferable tasks. However, the clear boundary of such a partition easily blurs in a human-AI collaborative environment.

References

  • [1]
  • [2] Ellen Abry, Ingrid Ø Thomassen, Øyvind O Salvesen, and Sverre H Torp. 2010. The significance of Ki-67/MIB-1 labeling index in human meningiomas: a literature study. Pathology-Research and Practice 206, 12 (2010), 810–815.
  • [3] Vahid Anari, Parvin Mahzouni, and Rasoul Amirfattahi. 2010. Computer-aided detection of proliferative cells and mitosis index in immunohistichemically images of meningioma. In 2010 6th Iranian Conference on Machine Vision and Image Processing. IEEE, 1–5.
  • [4] Thomas Backer-Grøndahl, Bjørnar H Moen, and Sverre H Torp. 2012. The histopathological spectrum of human meningiomas. International journal of clinical and experimental pathology 5, 3 (2012), 231.
  • [5] Peter Bankhead, Maurice B Loughrey, José A Fernández, Yvonne Dombrowski, Darragh G McArt, Philip D Dunne, Stephen McQuaid, Ronan T Gray, Liam J Murray, Helen G Coleman, and others. 2017.

    QuPath: Open source software for digital pathology image analysis.

    Scientific reports 7, 1 (2017), 16878.
  • [6] Jocelyn Barker, Assaf Hoogi, Adrien Depeursinge, and Daniel L Rubin. 2016. Automated classification of brain tumor type in whole-slide digital pathology images using local representative tiles. Medical image analysis 30 (2016), 60–71.
  • [7] Carrie J Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S Corrado, Martin C Stumpe, and others. 2019. Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
  • [8] Mackinlay Card. 1999. Readings in information visualization: using vision to think. Morgan Kaufmann.
  • [9] Anne E Carpenter, Thouis R Jones, Michael R Lamprecht, Colin Clarke, In Han Kang, Ola Friman, David A Guertin, Joo Han Chang, Robert A Lindquist, Jason Moffat, and others. 2006. CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome biology 7, 10 (2006), R100.
  • [10] Xiang’Anthony’ Chen, Ye Tao, Guanyun Wang, Runchang Kang, Tovi Grossman, Stelian Coros, and Scott E Hudson. 2018. Forte: User-driven generative design. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12.
  • [11] Tony J Collins. 2007. ImageJ for microscopy. Biotechniques 43, S1 (2007), S25–S30.
  • [12] Pat Croskerry, Karen Cosby, Mark L Graber, and Hardeep Singh. 2017. Diagnosis: Interpreting the shadows. CRC Press.
  • [13] Soumya De, R Joe Stanley, Cheng Lu, Rodney Long, Sameer Antani, George Thoma, and Rosemary Zuna. 2013. A fusion-based approach for uterine cervical cancer histology image classification. Computerized medical imaging and graphics 37, 7-8 (2013), 475–487.
  • [14] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, and others. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96. 226–231.
  • [15] Grzegorz T Gurda, Alexander S Baras, and Robert J Kurman. 2014. Ki-67 index as an ancillary tool in the differential diagnosis of proliferative endometrial lesions with secretory change. International Journal of Gynecological Pathology 33, 2 (2014), 114–119.
  • [16] David A Gutman, Mohammed Khalilia, Sanghoon Lee, Michael Nalisnik, Zach Mullen, Jonathan Beezley, Deepak R Chittajallu, David Manthey, and Lee AD Cooper. 2017. The digital slide archive: A software platform for management, integration, and analysis of histology for cancer research. Cancer research 77, 21 (2017), e75–e78.
  • [17] Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 770–778.
  • [19] Andreas Holzinger, Bernd Malle, Peter Kieseberg, Peter M Roth, Heimo Müller, Robert Reihs, and Kurt Zatloukal. 2017. Towards the augmented pathologist: Challenges of explainable-ai in digital pathology. arXiv preprint arXiv:1712.06657 (2017).
  • [20] Yongxiang Huang and Albert Chi-shing Chung. 2018. Improving high resolution histology image classification with deep spatial fusion network. In Computational Pathology and Ophthalmic Medical Image Analysis. Springer, 19–26.
  • [21] Humayun Irshad. 2013. Automated mitosis detection in histopathology using morphological and multi-channel statistics features. Journal of pathology informatics 4 (2013).
  • [22] Ellen C Jensen. 2013. Quantitative analysis of histological staining and fluorescence using ImageJ. The Anatomical Record 296, 3 (2013), 378–381.
  • [23] Daisuke Komura and Shumpei Ishikawa. 2018. Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal 16 (2018), 34–42.
  • [24] Bokyung Lee, Taeil Jin, Sung-Hee Lee, and Daniel Saakes. 2019. SmartManikin: Virtual Humans with Agency for Design Tools. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13.
  • [25] Yun Liu, Krishna Gadepalli, Mohammad Norouzi, George E Dahl, Timo Kohlberger, Aleksey Boyko, Subhashini Venugopalan, Aleksei Timofeev, Philip Q Nelson, Greg S Corrado, and others. 2017. Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442 (2017).
  • [26] David N Louis, Hiroko Ohgaki, Otmar D Wiestler, Webster K Cavenee, Peter C Burger, Anne Jouvet, Bernd W Scheithauer, and Paul Kleihues. 2007. The 2007 WHO classification of tumours of the central nervous system. Acta neuropathologica 114, 2 (2007), 97–109.
  • [27] Cheng Lu and Mrinal Mandal. 2013. Toward automatic mitotic cell detection and segmentation in multispectral histopathological images. IEEE Journal of Biomedical and Health Informatics 18, 2 (2013), 594–605.
  • [28] Anne L Martel, Dan Hosseinzadeh, Caglar Senaras, Yu Zhou, Azadeh Yazdanpanah, Rushin Shojaii, Emily S Patterson, Anant Madabhushi, and Metin N Gurcan. 2017. An image analysis resource for cancer research: PIIP—pathology image informatics platform for visualization, analysis, and management. Cancer research 77, 21 (2017), e83–e86.
  • [29] Tao Meng, Lin Lin, Mei-Ling Shyu, and Shu-Ching Chen. 2010. Histology image classification using supervised classification and multimodal fusion. In 2010 IEEE international symposium on multimedia. IEEE, 145–152.
  • [30] Rashika Mishra, Ovidiu Daescu, Patrick Leavey, Dinesh Rakheja, and Anita Sengupta. 2018. Convolutional neural network for histopathological analysis of osteosarcoma. Journal of Computational Biology 25, 3 (2018), 313–325.
  • [31] Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics 9, 1 (1979), 62–66.
  • [32] Lucio Palma, Paolo Celli, Carmine Franco, Luigi Cervoni, and Giampaolo Cantore. 1997. Long-term prognosis for atypical and malignant meningiomas: a study of 71 surgical cases. Journal of neurosurgery 86, 5 (1997), 793–800.
  • [33] Alexander Rakhlin, Alexey Shvets, Vladimir Iglovikov, and Alexandr A Kalinin. 2018. Deep convolutional neural networks for breast cancer histology image analysis. In International Conference Image Analysis and Recognition. Springer, 737–744.
  • [34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
  • [35] Curtis T Rueden, Johannes Schindelin, Mark C Hiner, Barry E DeZonia, Alison E Walter, Ellen T Arena, and Kevin W Eliceiri. 2017. ImageJ2: ImageJ for the next generation of scientific image data. BMC bioinformatics 18, 1 (2017), 529.
  • [36] Monjoy Saha, Chandan Chakraborty, Indu Arun, Rosina Ahmed, and Sanjoy Chatterjee. 2017.

    An advanced deep learning approach for Ki-67 stained hotspot detection and proliferation rate scoring for prognostic evaluation of breast cancer.

    Scientific reports 7, 1 (2017), 1–14.
  • [37] Joel Saltz, Ashish Sharma, Ganesh Iyer, Erich Bremer, Feiqiao Wang, Alina Jasniewski, Tammy DiPrima, Jonas S Almeida, Yi Gao, Tianhao Zhao, and others. 2017. A containerized software system for generation, management, and exploration of features from whole slide tissue images. Cancer research 77, 21 (2017), e79–e82.
  • [38] Johannes Schindelin, Ignacio Arganda-Carreras, Erwin Frise, Verena Kaynig, Mark Longair, Tobias Pietzsch, Stephan Preibisch, Curtis Rueden, Stephan Saalfeld, Benjamin Schmid, and others. 2012. Fiji: an open-source platform for biological-image analysis. Nature methods 9, 7 (2012), 676.
  • [39] Caroline A Schneider, Wayne S Rasband, and Kevin W Eliceiri. 2012. NIH Image to ImageJ: 25 years of image analysis. Nature methods 9, 7 (2012), 671.
  • [40] Burr Settles. 2009. Active learning literature survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.
  • [41] Harshita Sharma, Norman Zerbe, Iris Klempert, Olaf Hellwich, and Peter Hufnagl. 2017. Deep convolutional neural networks for automatic classification of gastric carcinoma using whole slide images in digital histopathology. Computerized Medical Imaging and Graphics 61 (2017), 2–13.
  • [42] Ie-Ming Shih and Robert J Kurman. 1998. Ki-67 labeling index in the differential diagnosis of exaggerated placental site, placental site trophoblastic tumor, and choriocarcinoma: a double immunohistochemical staining technique using Ki-67 and Mel-CAM antibodies. Human pathology 29, 1 (1998), 27–33.
  • [43] Christoph Sommer, Christoph Straehle, Ullrich Koethe, and Fred A Hamprecht. 2011. Ilastik: Interactive learning and segmentation toolkit. In 2011 IEEE international symposium on biomedical imaging: From nano to macro. IEEE, 230–233.
  • [44] Hamid Reza Tizhoosh and Liron Pantanowitz. 2018. Artificial intelligence and digital pathology: challenges and opportunities. Journal of pathology informatics 9 (2018).
  • [45] Mitko Veta, Paul J Van Diest, Robert Kornegoor, André Huisman, Max A Viergever, and Josien PW Pluim. 2013. Automatic nuclei segmentation in H&E stained breast cancer histopathology images. PloS one 8, 7 (2013).
  • [46] Brian Patrick Walcott, Brian V Nahed, Priscilla K Brastianos, and Jay S Loeffler. 2013. Radiation treatment for WHO grade II and III meningiomas. Frontiers in oncology 3 (2013), 227.
  • [47] Nora S Willett, Rubaiat Habib Kazi, Michael Chen, George Fitzmaurice, Adam Finkelstein, and Tovi Grossman. 2018. A mixed-initiative interface for animating static pictures. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 649–661.
  • [48] Fuyong Xing, Toby C Cornish, Tell Bennett, Debashis Ghosh, and Lin Yang. 2019. Pixel-to-pixel learning with weak supervision for single-stage nucleus recognition in Ki67 images. IEEE Transactions on Biomedical Engineering 66, 11 (2019), 3088–3097.
  • [49] Choon K Yap, Emarene M Kalaw, Malay Singh, Kian T Chong, Danilo M Giron, Chao-Hui Huang, Li Cheng, Yan N Law, and Hwee Kuan Lee. 2015. Automated image based prominent nucleoli detection. Journal of pathology informatics 6 (2015).
  • [50] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. 2018. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 3–11.

Appendix A (A) Dataset Descriptions and Model Details

Mitotic Count We first collected 1183 (478 positive, 705 negative) nucleus and cropped them into -pixel222the dimension of each pixel is 0.2508m

patches. We then split the dataset with a 0.9/0.1 train/validation ratio. Next, we used a Keras implementation of the ResNet50 model

[18]

and train with Adam optimizer, binary cross-entropy loss, 100 epoch, and a batch size of 32. To avoid false positive cases reflect the grading result, only hard-positive nucleus with prediction probability > 0.7 are counted as mitosis.

Ki-67 Index We used positive cell count function from [16] to detect both Ki-67 positive and negative nucleus. The Ki-67 index is calculated as in -patch.

Hypercelluarity We first obtained 11 regions with cell segmentation with ground truth. Then we used random crop technique and constructed a training set with 1780 -pixel training samples. Next we trained a U-Net nuclei segmentation network [34] with Adam optimizer, mean-Intersection-Over-Union metric, binary cross-entropy loss, 20 epoch, and a batch size of 8. We further used a dynamic watersheding method to post-process the U-Net segmentation result and finally got the nuclei counting result.

Necrosis We collected 26 necrosis regions and 37 non-necrosis regions in total. For each region, we used a random-crop method to obtain 3-pixel patches. Next we constructed a training set with 2004 patches (88 positive, 1916 negative, from 51 regions) and a validation set with 571 patches (52 positive, 519 negative, from 12 regions). We used a ResNet50 model and train with Adam optimizer, binary cross-entropy loss, 30 epoch and a batch size of 4.

Small Cell Due to a lack of annotated data, we selected the top 10 -pixel patches that have the highest nuclei count within each slide as small cell recommendations. Note that the small cell also shares a hypercellular pattern, only patches that have >125 nuclei/patch are counted.

Prominent Nucleoli We collected 755 (206 positive and 549 negative) nucleus and cropped them into

pixel patches. We then split the dataset with a 0.9/0.1 train/validation ratio and trained with a ResNet50 model with the same hyperparameters as used in mitotic count identification. To avoid false positive cases influence the result, only nucleus that have >0.9 prediction probability are counted as positive.

Sheeting We collected 20 meningothelial regions, 12 fibrous regions and 12 other regions from 8 slides. For each region, we used a random-crop method to obtain 5123-pixel patches. We constructed a training set with 2219 samples (719 meningothelial, 790 fibrous, 710 other) from 34 regions and a validation set with 262 samples (91 meningothelial, 70 fibrous, 101 other) from 10 regions. We then trained a ResNet50 model with the same hyperparameters as the necrosis detector.

Brain Invasion Since meningioma is a high-cellular tumor, we used Otsu’s thresholding method [31] to differentiate the tumor and non-tumor brain tissues with cell-counting: only the patches that have more than 55 nucleus/512512-pixel patch are counted as tumor patch.

Specific thresholds used for each criterion was obtained via discussion with a pathologist. We trained the models with the datasets on a CentOS 7 server, with Intel Xeon W-2133 CPU, 104 GB memory, and two Nvidia RTX-2070 graphics cards.

Appendix B (B) ROI Sampling Rules

Below are the ROI sampling rules used in CrossPath:

  • a hypercellularity ROI is defined as a cluster that has more than 2000 nuclus/1HPF;

  • a necrosis ROI is defined as a cluster that has probability >0.7 from necrosis detector;

  • small cell ROIs are defined as top 10 -pixel patches that have the most nuclei count over the slide;

  • a prominent nucleoli ROI is defined as a cluster that has more than 60 prominent nucleolus/1HPF;

The areas of hypercellularity, necrosis, and prominent nucleoli ROIs are calculated by DBSCAN clustering algorithm [14]. If no matching ROI is found on one slide, CrossPath’s corresponding evidence list would be displayed as ‘no evidence found’.