With the rapid proliferation of Electronic Health Records (EHRs), various prediction models based on machine learning (ML) techniques have been proposed for improving the quality of clinical care [rajkomar2019machine]. An EHR stores an individual’s health profile, from structured attributes like demographic information and medications, to unstructured ones, such as clinical notes and medical images. Prediction models, trained on patients’ EHR data, can be useful for a wide range of medical tasks [harutyunyan2019multitask, rajkomar2018scalable], including predicting a patient’s remaining length of stay, the likelihood of hospital readmission, and in-hospital mortality.
Despite efforts by researchers and developers to improve the performance of these prediction models, challenges remain – including many associated with transparency and interpretability, which are particularly relevant in a highly regulated and risk-averse domain like healthcare [payrovnaziri2020explainable, sendak2020human]. At the same time, XAI (eXplainable Artificial Intelligence) techniques and software tools continue to be developed, many of which have already proven powerful at elucidating the workings of “black-box” ML models. Nevertheless, prediction models built upon modern ML techniques have not yet been widely and reliably used in clinical decision support workflows [jensen2012mining, miotto2016deep, rajkomar2019machine]. By surveying related literature [ahmad2018interpretable, lipton2017doctor, payrovnaziri2020explainable, tonekaboni2019clinicians, wang2019designing, yang2019unremarkable] and working with six clinicians from a children’s hospital, we found that the barriers preventing the application of XAI techniques in clinical settings are twofold.
First of all, clinicians engaging with XAI tools are often presumed to have sufficient technical expertise to understand and even improve ML models [ahmad2018interpretable]. In reality, clinicians – who may have little to no technical background – are more likely to assess ML predictions through the lens of their domain expertise [sendak2020human] rather than understand and improve the ML model from the technical point of view. This disconnection between the technology and its users is exacerbated by the fact that clinicians are rarely involved in discussions of explainability during the development of XAI tools [lipton2017doctor]. As a result, the solutions provided by these tools are often intrinsically technical, leading to the difficulty for clinicians in understanding the “explanations” themselves [miller2019explanation, tonekaboni2019clinicians].
In addition, clinicians’ workflows are often guided by individual patients and may require tailored explanations based on each patient’s EHRs (i.e., local explanations). Among the copious XAI approaches that support local explanations, feature contribution is one of the most popular. Approaches from this category illustrate the degree of contribution particular ML features make to a prediction outcome [payrovnaziri2020explainable], which allows clinicians to directly compare model decisions with their own clinical judgment, especially when there is a disagreement. However, although these approaches have been extensively studied within the XAI field, there are still several significant challenges attendant to their actual use in healthcare. Clinicians working with ML may run into problems in the following areas:
Understanding ML features. Not every feature inputted to ML models is interpretable as-is by clinicians. For example, a patient’s vital sign (e.g.
, in-surgery heart rate) will be transformed into multiple ML features, each represented by an aggregate value (e.g., SD (standard deviation) or Trend (linear slope)) within a period (a.k.a., feature engineering) [nemati2018interpretable]. While easily understood by an ML model, this form of representation is almost certainly unfamiliar and non-intuitive to clinicians – they may struggle to judge what, for example, a “high” Trend indicates, and the potential consequences.
Connecting to patients’ original records. Clinicians are more familiar with a given patient’s original records than they are with ML features. In practice, they usually make decisions by referring to the raw data, such as laboratory test reports and vital signs from an anesthetic machine. However, feature contribution techniques only provide explanations on ML features, and do not deal with records directly [eck2017interpretation, ge2018interpretable, nemati2018interpretable, shrikumar2017learning]. How to seamlessly connect these explanations to patients’ original records remains an open question, and one that is underexplored.
Aligning with evidence. Simply presenting a list of feature contributions in the form of numerical values does not allow clinicians to assess the trustworthiness of the model’s predictions. Clinicians need to understand how feature contributions align with evidence-based medical practice [haynes1997evidence, tonekaboni2019clinicians]. In this research, we propose using cohort-level statistics, available through hospital records, to provide this evidence. Clinicians can compare a target patient’s feature values with reference values extracted from a cohort of similar patients.
The aforementioned challenges motivate us to design and develop a visual analytics solution that can seamlessly integrate feature-level explanations into a clinician’s decision-making workflow. We followed a user-centric design process [wang2019designing, yang2019unremarkable] from the outset, working with six pediatric clinicians with an average of 17 years of work experience. We derived seven design requirements from a pilot study with these clinicians; then, by observing their interactions with our early-staged system, we summarized two workflows – forward analysis and backward analysis – preferred by clinicians with different levels of expertise. These requirements and workflows guided the overall design and development of , a Visualization system that Bridges the gap between clinicians and ML models with tailored feature explanation algorithms and novel interaction and visualization techniques.
We adopted SHAP values [lundberg2018explainable]
to generate contribution-based explanations of ML features and organized a large number of features in a hierarchy to facilitate interpretation. We developed a novel visualization – an interactive hierarchical feature list – to present such explanations to clinicians in a user-friendly manner and integrated tailored visual designs to allow clinicians to conduct reference-value-based analysis and what-if analysis at the feature level. To enable the connection between the feature explanations and the patient’s raw records, we applied Deep Feature Synthesis[kanter2015deep] on EHR data to build traceable transformation paths between features and raw records. Based on this, we present a tailored algorithm to identify the most influential records for a given feature. The patient’s original records are visualized in multiple coordinated views with different levels of detail. Various novel interactions, including linking and marking, help to visually associate the feature-level explanations and context information. The system was evaluated through two case studies and an expert interview with four clinicians, and results showed that our system is capable of supporting clinical decision-making.
To sum up, our contributions include:
A summary of seven design requirements facilitating the interpretation of ML predictions to clinicians; and the identification of two workflows describing how they work with ML models with feature-level explanations and needed context information.
A visual analytics system that integrates novel explanation algorithms and visualization and interaction techniques, to connect the dots between ML features, explanations, and health records for an improved clinicians’ decision-making workflow.
Two case studies and an expert interview demonstrating the usefulness and efficiency of our system.
2 Related Work
2.1 Explainable Machine Learning in Clinical Predictions
We categorize existing XAI techniques in clinical research based on whether the provided interpretability is intrinsic or post-hoc.
Intrinsic interpretability. Models provide intrinsic interpretability by directly incorporating interpretability into their structures [choi2016retain, fleisher2007clinical, jalali2016interpretable, kho2012use, kwon2018retainvis, payrovnaziri2020explainable]. Models in this category often use a simple structure to provide accurate and faithful explanations. For example, Kho et al. [kho2012use]
used decision trees, which surface the set of rules driving the predictions, for predicting the genetic risk of type 2 diabetes. Despite their intrinsic interpretability, the performance of these models is bounded compared to advanced ML models (e.g.
, deep neural networks), especially when handling complex clinical prediction tasks[harutyunyan2019multitask, xiao2018opportunities]. Boosting and optimization techniques such as ensemble learning [jalali2016interpretable] can be used to enhance performance, but often at the cost of introducing additional complexity that impairs interpretability.
Recently, attention-based neural networks have begun to draw more focus [payrovnaziri2020explainable]. Such models do not directly inform clinicians of the reasons behind a prediction, instead highlighting the portion of historical data (e.g., clinical events) that have factored into it [choi2016retain, kwon2018retainvis]
. Although deep learning models can produce accurate predictions, attention-based explanations may cause information overload and confuse clinicians due to the lack of clarity around how prediction results relate to the areas of attention[payrovnaziri2020explainable]. It is also challenging for attention-based deep learning models to support multimodal learning while preserving good interpretability [xiao2018opportunities].
Post-hoc interpretability. Post-hoc methods take “black-box” ML models as inputs and then derive explanations for model predictions [che2016interpretable, cheng2020dece, NIPS2017_7062, ming2018rulematrix, ribeiro2016should]. Unlike intrinsic interpretable models, post-hoc methods can be directly applied to existing models and thus are more flexible. One common approach is to use an intrinsically interpretable ML model to mimic a complex “black-box” ML model. For example, Che et al. [che2016interpretable]
worked on acute lung injury (ALI) prediction and proposed a knowledge distillation method called mimic-learning, which uses gradient boosting trees to mimic the original deep learning model and provides rule explanations to clinicians.
Another type of work focuses on calculating feature contribution, which along with attention mechanism-based models is considered to be one of the top popular approaches for supporting local explanations [payrovnaziri2020explainable]. For example, Shapley Additive Explanations (SHAP) [NIPS2017_7062]
, which build on the Shapley value from cooperative game theory[shapley1953value], have been applied to explain in-surgery hypoxaemia predictions and support early prevention [lundberg2018explainable].
Our work provides a post-hoc method for explaining existing models, in which we adopt feature-contribution-based XAI approaches. In particular, we use SHAP to compute how each ML feature contributes to a particular prediction. We present a tailored visualization technique to display feature contributions to clinicians in a scalable and user-friendly manner.
2.2 Electronic Health Records Visualization
We classify existing visualization techniques on EHR data[west2015innovative] based on the criterion proposed by Rind et al. [Rind2013Survey] – visualization for exploring health records from one patient or multiple patients.
Individual patient records. The goal of visualizing individual patient records is to provide individual patient summaries, as well as an efficient way to explore personal complex record data at different levels of detail. A patient’s clinical records contain longitudinal data representing patient visits over time. One common way to summarize this history is through timeline-based visualizations, where events are placed on a horizontal timeline chronologically, using points or interval plots [plaisant2003lifelines, shahar1999intelligent, Shahar2006KNAVEII]. For events containing multiple attributes, glyphs [cao2011dicon] and additional tables [bauer2006evaluating, Ghassemi2018ClinicalVis] are used to visually summarize events and facilitate more detailed explorations. To further improve scalability, researchers have explored aggregation-based methods [krstajic2011cloudlines] and substitution-based approaches [Gotz2014DecisionFlow] to show frequent patterns instead of event details. Another line of research focuses on event pattern searching, filtering, and grouping [wongsuphasawat2009finding, wongsuphasawat2012querying], which support fast and efficient data exploration.
In addition to discrete events, clinical signals collected during ICU or surgery are also commonly included in the EHR data. These are usually sampled at a higher frequency and can be viewed as continuous time series data. Xu et al. [Xu2018ECGLens] used a spiral timeline to reveal periodic patterns of electrocardiogram data for arrhythmia detection. Our system builds on advances offered by prior visualization techniques to visualize a patient’s EHR data at different levels of detail. We further tailored them for better interpreting ML predictions with feature-contribution-based XAI approaches.
Multiple patient records. A number of scenarios require the analysis of multiple patient records, from patient cohort monitoring to observational clinical research. A large number of works focus on visualizing longitudinal EHR data [bernard2018using, caballero2017visual, Gotz2014DecisionFlow, gotz2014methodology, Jin2020CarePre, krstajic2011cloudlines, kwon2018retainvis, malik2015cohort, phan2007progressive, wongsuphasawat2011outflow], where glyphs [bernard2018using, caballero2017visual] and flow-based representations [Gotz2014DecisionFlow, wongsuphasawat2011outflow] are often used for summarization. Other works focus on visualizing multivariate attributes or features transformed from the original records [alemzadeh2017subpopulation, bernard2015visual, Krause2014infuse, kwon2017clustervision, muller2020visual]. For example, Krause et al. [Krause2014infuse]
designed a glyph to visualize the quality of a feature under different metrics. In this work, we used aggregation-based methods to extract reference values from a cohort of patients. We proposed small intense, simple, and embeddable visualizations to show the reference values. These are integrated with the feature explanation view and raw record data visualization view to allow reference-value-based analysis.
3 Informing the Design
In this section, we introduce the pilot study and detail the design requirements and analysis workflow distilled from the study.
3.1 Pilot Study
The pilot study allowed us to understand how clinicians expect to use ML prediction models with feature contribution explanations to support their clinical decision-making. We followed the design study methodology from Sedlmair et al.’s work [sedlmair2012design] and designed the pilot study as follows.
Participants: The study involved 6 clinicians (3 male and 3 female) from the Children’s Hospital of Zhejiang University School of Medicine (ZJUCH): two chief physicians from the Cardiac Intensive Care Unit (CICU) (P1-2) and four residents from the Cardiology Department (P3-6). Among them, P1-3 are more senior, with an average of 24.5 years of work experience (20, 29, and 24 years respectively), while the others (P4-6) have an average of 10.5 years of experience (13, 10, and 8.5 years respectively).
Presetting: The pilot study is based on a scenario of postoperative complication predictions. Patients may develop various complications after surgeries, some of which can be life-threatening. Predictions in an early phase can help clinicians identify high-risk patients and carefully choose postoperative caring plans. To support this scenario, we built a demo model on this prediction task. We worked with a biomedical data scientist (DS) from ZJUCH – a co-author of this paper, to carefully select a small set of features to train the model. We use SHAP values to show features’ contributions to the prediction result.
Process: The study was divided into two sessions. We began the first session by performing one-on-one, semi-structured, hour-long interviews with all the participants. During the interview, the participants were presented with a low-fidelity mockup of our early system and taught some basic ML concepts. They were asked several questions about their understanding and concerns. Based on the feedback collected from this session, we formulated the initial design requirements. Over the next three months, we developed a high-fidelity prototype system, holding weekly meetings with DS to make sure our implementation continued to meet requirements.
In the second session, we presented the prototype system to three participants (P2, P3, P5) separately. They were asked to explore the system freely and completed several prediction tasks then, during which they were encouraged to think aloud to explain their thoughts. We observed, took notes, and collected their interaction processes. We then held an open discussion with them to further understand their behavior. The feedback collected from this round is further used to polish our design requirements and refine our system.
3.2 Design Requirements
We summarized seven design requirements and grouped them into: feature-level explorations (Feature), record-level explorations (Data), and explorations of feature-record connections (Bridge).
Feature Show features in a hierarchical structure. All participants (P1-6
) confirmed that it is challenging to explore hundreds of features extracted from diverse and heterogeneous sources. They all agreed with the idea of grouping relevant features semantically for a better exploration experience. For example, the aggregation values (e.g., Mean, SD, and Trend) computed from the same series of data (e.g., pulse) can be reasonably grouped.
Feature Provide features’ reference values. All participants (P1-6) agreed that the features, especially aggregate values that are unfamiliar to clinicians (e.g., SD and Trend), should be presented alongside reference values, which describe the range of values that is considered normal. Because there are no existing reference values for most of the features, the system should calculate them using data from a relevant cohort.
Feature Provide flexible interactions to support on-demand explorations. Participants follow different strategies when exploring features. Some participants (P1, P3-5) are only interested in the most risky factors (i.e., features with high positive contributions), while some (P2, P6) were also interested in negatively contributing features that could be helpful for lowering the surgery risks in the future. Thus, the system should enable sorting and filtering to support different exploration paths. In addition, P1 and P2 expressed a further need to conduct what-if analyses on abnormal features to better understand their effects on predictions.
Data Provide an overview of the patient’s records. Patients have complex medical records, especially ICU patients, who may have substantially more data available than general patients. Good visualizations summarizing a patient’s visiting history “can save us a great amount of time in [familiarizing ourselves with] a patient’s background”, P2 and P6 confirmed.
Data Show record details with reference values. Similar to 2, participants (P1, P3-5) suggested that showing patients’ historical health record details along with reference values helps them to make more informed decisions. P1
would like to know whether these values are within the 95% confidence interval (CI) of a statistical summary of similar patients.
Bridge Visually associate feature(s) with the patient’s records. All participants (P1-6) expressed the need to check the original records (i.e., medical events) of particular features that interested them. “Manually checking without any clues would take me 10-20 minutes”, P5 commented. Thus, visual associations of such correlations along with a tailored interaction mechanism should be enabled to support efficient back-and-forth analysis between features and their relevant original records.
Bridge Highlight temporal value patterns that are influential to feature(s). Three participants (P1, P3, P5) expected the system to highlight high-risk time periods from long-lasting vital sign records related to the feature under investigation. P1 showed particular interest in influential time periods containing a series of data points, rather than isolated anomalous data points which could be caused by errors.
3.3 Analysis Workflow
After analyzing the user interaction patterns and discussion notes from the second pilot study session, we summarized two general analysis workflows: forward analysis and backward analysis.
Forward analysis: clinicians inspect the data in an order similar to that of the direction of the data processing flow (i.e., original records features predictions). They first make their own prediction based on the patients’ original records, then compare their predictions with the model predictions, and finally check explanations to make decisions. Senior clinicians such as P1 and P2 preferred to start by viewing the patient’s profile and forming initial hypotheses. They would then look directly at the original records (4, 5) and check potentially influential observational records (e.g. in-surgery lactate records). After making their own predictions based on this evidence, they seek confirmation from ML models and feature explanations (R1-3). If the model prediction and explanations agreed with their expectations – assigning high contribution values to the factors they thought were risky – their trust in the model was enhanced. If it didn’t, they would refer to the patient’s original records relevant to the features they were investigating to find more evidence based on reference values (R5-7). They would either reject the model prediction, or would gain new knowledge based on this evidence.
Backward analysis: clinicians inspect the data in the opposite direction as the data processing flow. They check the model predictions and explanations first, and then trace back to the original records, finding evidence to support their decisions. When clinicians started without a clear diagnostic prediction, they began with a feature list with contribution explanations (R1-3). They then identified a set of features for further investigation. For static and familiar dynamic features (e.g., in-surgery pulse), they compared the feature contributions with their expectations. For unfamiliar features, like SD or Trend of in-surgery systolic blood pressure, they preferred to check the details in the original records (R5-7). Sometimes these records were not sufficient to verify their hypothesis. They would then check the summary information of the original records (R4) in order to obtain different records from the same time, for correlation analysis.
4 Predictive Modeling
In this section, we first introduce the dataset, along with the prediction task we use as a running example for our research. Then we introduce how we extract ML features and generate explanations.
takes structured EHR data collections as input. These are often organized as relational databases. In this work, we use the Paediatric Intensive Care (PIC) Database [zeng2020pic] as an example, which contains de-identified clinical data of paediatric patients admitted to ZJUCH. In particular, the dataset collects over hospital admissions from unique paediatric patients, aged 0–18 years, admitted to the critical care unit between 2010 and 2019. The PIC follows the same paradigm to store ICU patient clinical records as the widely-studied Medical Information Mart for Intensive Care (MIMIC-III) dataset [johnson2016mimic], but puts more emphasis on paediatric patients. The dataset encompasses a number of information types, including demographics, surgery information, high-resolution vital sign measurements during surgery, laboratory test results, symptoms, medications, diagnostic codes, and mortality.
4.2 Running Example: Surgical Complication Prediction
To concretize our system’s contributions, we utilize a running example – predicting complications after cardiac surgery – from a case study involving two clinicians from the ZJUCH team (P1, P5). This team was interested in using ML models to predict whether a patient is at risk of developing five types of complications after cardiac surgery: lung, cardiac, arrhythmia, infectious, and others, which are each annotated with their first letter (i.e., L, C, A, I, O). A patient may experience multiple postoperative complications.
Working with the team and starting with the entire PIC dataset, we first selected patients who underwent cardiopulmonary bypass-supported cardiac surgery. 456 (25.0%) of these patients developed postoperative complications. From the medical records of these patients, we mainly extracted three types of static features (demographics, surgery information, and diagnosis results) and three types of dynamic features whose values change over time (lab tests, surgery vital signs, and chart events111Chart events contain patients’ routine vital signs (not during surgery) and additional information like inputs and outputs.). In total, there were 1,724,805 lab test events, 450,989 chart events, and 754,213 data points from vital signs. In this example, our goal was to build 5 individual binary classifiers, each predicting one of the five complication types.
To be accepted by an ML model, a patient’s raw medical data must be transformed into an ML-understandable format (a.k.a.
, feature engineering) – namely, a feature vector (Fig.1C⃝). Multiple feature vectors compose a feature matrix with each row describing one patient. Given the feature matrix and the target prediction column, in order to obtain the best ML model for our task, we applied Cardea [alnegheimish2020cardea]
– an automated machine learning (AutoML) framework for EHR data. The framework evaluated 8 classifiers whose hyperparameters were optimized using AutoML for a higher performance score – the averaged AUC ofcross-validation folds (see Appendix). Finally, we obtained five models, each of which performed the best for one complication.
4.3 Feature Extraction
As shown in Fig. 1A⃝-left, EHR data from different sources is described as different entities, such as Admission (), Lab Test () and Surgery (). Entities are connected by reference keys (Fig. 1A⃝-right). In our running example of surgical complication prediction, we worked with our clinician collaborators and identified six feature types, both static and dynamic, with which to compose a patient’s feature vector. Our target patients are those who underwent cardiac surgery. To that end, we chose Surgery () as the target entity, and extracted the associated features including patient profile from , surgery information from , low-resolution time series (lab test and chart events) from and , and high-resolution time series from .
Assembling patient feature vectors with DFS. We adapted Deep Feature Synthesis (DFS) [kanter2015deep] – an algorithm that automatically generates features out of relational tables – for our scenario to ensure that the connection between a patient’s feature vector and raw records is traceable (6). It works by following relationships between tables to a base field (e.g., SurgeryId) and then sequentially applying transformation functions along the path to create the final feature. In the end, the algorithm will recursively extract all the associated features to our target entity (Surgery ) and each feature corresponds to a traceable path of length between the final feature value and the raw record(s) of the source entity. The path is important for the purpose of visualization and the identification of influential records (6, 7).
4.4 Feature Explanation
We applied SHAP values to provide feature-level explanations. However, for features that are unfamiliar to clinicians (e.g., Trends), such explanations are not sufficient. Clinicians wish to further understand which time periods within the records (Fig. 1C⃝) are responsible for the feature of interest (6, 7).
A common approach is occlusion sensitivity [Zeiler_Fergus_2014]. However, simply removing several medical records and observing how the prediction changes is not a feasible solution, because a surgical patient usually produces thousands of records – meaning that the model will not be sensitive if only a small number of records are removed. A similar approach, observing how relevant features change under the occlusion, has the same sensitivity issues. To solve the underlying sensitivity issues, we first calculate the influence of the records on the relevant features’ values and identify the most influential time periods, using occlusion-based methods introduced below. Then we filter the influential record segments that push the relevant features’ value away from the average level (i.e., the reference value). Consider a scenario in which a patient’s , a major contributing feature, is significantly higher than the reference value. Clinicians may want to know during which period the records specifically cause a sudden increase in feature values, rather than all influential periods.
Computing record influence on a dynamic feature. Given a window size of , a series of temporally ordered records , and a results array of length initialized to all 0s, we iteratively replace – or “occlude” – segments with some set of values, incrementing by after each step (i.e., sliding window). We use window size to reduce the impact of data quality issues and focus more on a segment of data (7). We propose the use of a linear curve fit to the points in the window, which maintains smoothness while removing unique features within the window.
After each occlusion step, we recalculate the feature value, and store the change between the original feature and the updated feature in the corresponding indices of : . The results in show the relative total influence of each record in , based on how much and in which direction the feature changes when this point is removed. Notably, the real time computation of is possible because we store the traceable path between the relevant raw records and the feature value (Sec. 4.3).
Identifying the most influential time periods. Now that we have obtained an array of influence values , the next step is to highlight the most influential time periods (7). This involves finding a threshold and identifying a list of segments with values above that threshold. Given that parametric approaches such as use of a Gaussian tail can be flawed when parametric assumptions are violated (e.g.
, that the data follow Gaussian distributions), we adopted a non-parametric method without statistical assumptions. This method is adapted from the dynamic threshold computing method proposed by Hundmanet al. [hundman2018detecting]. We pick a threshold from the set: , where an ordered set of positive values indicating the number of SD () above the mean (). The optimal is determined by:
where , , , and The goal is to find a threshold that – once all values above it are eliminated – would lead to the maximum percent decrease in mean and SD of . Then we can obtain which represents the set of most “exceptionally” influential segments for a feature. We propose a novel visualization for showing this information to clinicians, detailed in Sec. 5.3.
In this section, we continue with our running example – surgical complication prediction – to introduce our system designs.
5.1 System Overview
Fig. 2 illustrates the system architecture and the interactive analysis pipeline it supports. comprises four major modules: (1) storage, (2) analysis, (3) explainer, and (4) interface. The storage module saves all original patient records, a feature matrix with each row representing a patient’s clinical features, and the ML prediction results. The analysis module supports dynamic calculation of reference values when a cohort of patients is selected, and real-time computation of the results for what-if analysis. The explainer module uses SHAP values to represent feature contributions, and identifies influential time periods for a given feature. Lastly, the interface module supplies multiple visual views, allowing a clinician to carry out his/her analysis using either a forward or backward workflow.
To show how the five views in the interface are connected, we assume that a clinician is using the backward analysis workflow to investigate the risk that a patient will have postoperative complications. First, he picks the patient and complication of interest from the top menu (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare ModelsA⃝). The five icons beside the selection box show the prediction results for the five types of complications – orange for positive and blue for negative. Next, he views the patient demographic, surgery, and admission information through the Profile View, identifies a cohort of patients as the reference patient group using the Filter View, and checks the patient number on the top menu (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare Models).
Next, he begins formally investigating from the Feature View, which hierarchically shows ML-features, their contributions to the prediction result, and their reference values (1, 2, 3). To further investigate the features of interest, he should check the original records of these features using the Temporal View; this view visualizes how a patient’s clinical records change over time, along with the calculated reference range (5) and the influential periods (7). Multiple types of visual association, such as linking, filtering, and highlighting, are enabled to support the back-and-forth analysis between the Feature View and the Temporal View (6). If the original records of the target feature (e.g., Heart Rate) are not sufficient to verify the hypothesis, the clinician should then refer to the Timeline View to understand the overall situation and select contemporary medical records from other relevant features for further investigation (4).
5.2 Feature View
The Feature View (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare ModelsD⃝) aims to allow clinicians to explore and understand the model’s behavior at the feature level.To provide more consistent representations of these massive features, we have grouped relevant features according to suggestions from our clinician collaborators (1). We first group the features (e.g., Mean, SD, and Trend) that were extracted from the same series of medical events (e.g., Pulse Records). We further divide the features (or groups) according to their temporal occurrence, which includes “pre-surgery” features (e.g., demographics) and “in-surgery” features (e.g., vital signs).
Hierarchical display (1). We visualize the features in a hierarchical list where each row represents a feature or a feature group (Fig. 3). For each feature, we present the value and its contribution to the model prediction. We visually encode the contribution value with a horizontal bar, where the color encodes its sign (red for increasing complication risks and blue for decreasing risks) and the length encodes its magnitude (Fig. 3A⃝). For a feature group, we calculate group-level contributions by summing up the included features’ additive contributions.
Based on the definition of Shapley values [shapley1953value], group-level contributions can be explained as an approximation of the effects of removing this group of features from the model. Because clinicians have different goals or levels of knowledge, some expect to investigate the most fine-grained level of features (e.g. SD and Trend) while others may stop at the group level. The hierarchical feature list matches their demands well in this regard. Sorting and filtering by contributions are also supported to offer clinicians more control during explorations (1).
References from cohorts (2). In , reference values are calculated from a relevant cohort (e.g., patients in the same age range) selected by users through the Filter View. The selected cohort is further divided into a low-risk group (i.e. no complications) and a high-risk group (i.e. one or more complications). We use the 95% CI of the low-risk group’s mean value as the reference value range. We use an upward/downward arrow to indicate whether a value is beyond the upper/lower bound of the reference range (Fig. 3B⃝).
Clinicians can click the value area to inspect detailed value distributions of the low-risk group and the high-risk group (Fig. 3C⃝). For a continuous feature, the distribution is visualized with area charts, where a red line indicates the position of the feature value in relation to the target patient. For a categorical feature, we use bar charts to depict the distribution rather than area charts.
What-if analysis (3). In evidence-based clinical practice, clinicians pay a lot of attention to anomalous records (e.g., low Oxygen Saturation Rate) in the process of clinical reasoning. Our system marks values out of the reference range as anomalies. Clinicians are particularly interested in highly contributed features with anomalous values. For example in Fig. 3B⃝, the surgery time (296 minutes) is noted as an exceptionally high value by the upper arrow, and it also makes the highest contribution to the prediction. Our clinician collaborators had expressed strong interest in such cases, leading us to ask: If such a value is normal, does it still make a large contribution?
To answer this question, we designed a reference-value-based what-if analysis technique. Unlike open-ended what-if analysis techniques [wexler2019if], we focus on one abnormal feature at a time, and make a minimal change to fit the reference range (e.g., setting a high blood-pressure-related feature value to the upper bound of its reference range). Then we calculate and visualize changes in the prediction result and the target feature’s contribution (Fig. 3C⃝ - bottom right). We designed visualizations to encode the contribution changes while reserving the original contribution (i.e., solid and dashed area) as context (Fig. 3D⃝). This approach provides clinicians with the most efficient and familiar way to verify their findings, especially when they aren’t well-practiced in setting feature values.
5.3 Temporal View
The Temporal View visualizes a list of time series – each representing a type of time-varying clinical feature (e.g.., Heart Rate) – in order to provide context for feature-level explanations (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare ModelsE⃝). When clinicians find interesting features in the feature view (e.g., Mean of Oxygen Saturation), they can append the corresponding time series records to the temporal view for further inspection (5).
Each time series is visualized as a line chart (Fig. 4B⃝). We use the translucent blue area with a horizontal line in the middle to show the reference range (i.e., 95% CI) and the mean value from the selected patient group. This design is familiar to clinicians and has frequently been used in clinical research [2012SimTwentyFive]. In this paper’s running example, children’s observed values (e.g., Pulse) vary significantly in different situations. In response, we compute reference values dynamically according to the selected group of patients similar to the feature reference values (Sec. 5.2). We further empower the design to support the analysis of anomalies, concurrent patterns, and influential segments.
Highlighting anomalous records and enabling concurrent patterns analysis. Inside the line chart (Fig. 4B⃝), we use red dots and line segments to highlight out-of-reference-range records and time periods. To support clinicians inspecting multiple time series at the same time for concurrent pattern analysis (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare Models), we use a space-intensive design to only show the out-of-reference-range segments (Fig. 4A⃝). The arrow direction indicates whether a point (segment) is above or below the reference value range, whose design is consistent with the one used in the Feature View (Fig. 3B⃝).
Highlighting influential value patterns. Reference values provide clinicians with an evidence-based method for insight verification. However, clinicians are also curious to see how a ML model judges the influence of certain time periods, such as high-risk periods captured by the model (7). We use the algorithm described in Sec. 4.4 to identify the most influential non-overlapping time segments and highlight them in the line chart (Fig. 4E⃝). However, multiple features (Mean, SD, Trend, etc.) may be associated with the same series of records (e.g, Pulse), so segments that are influential to different features can overlap. These overlapped areas often suggest highly influential time periods because they contribute to many features simultaneously. Inspired by Kim et al. [kim2021towards], we consider three design alternatives (Fig. 4C⃝-E⃝) for highlighting prominent regions in the line chart. For C⃝, the bordered bounding box is accurate and clean, but not efficient at highlighting the overlapped area. For D⃝, the translucent full-height box highlights the overlap well, but is visually crowded. In accordance with our clinician collaborators, we finally chose the last design E⃝ which combines the advantages of the other two designs.
5.4 Timeline View
The Timeline View (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare ModelsC⃝) provides an overview of the target patient’s health records (4). This view is the starting point for clinicians who use the forward analysis workflow (3.3). In the meantime, it is also an indispensable part of the backward analysis workflow (3.3), when a clinician desires to understand more contextual information about the patient. Through this view, clinicians can move additional medical records into the Temporal View for comparative analysis.
We use a matrix-based visualization [weng2021towards] to show a summary of the target patient’s medical events from different sources (lab tests, vital signs, and chart events) (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare ModelsC⃝). The horizontal timeline is divided into predefined, equal time intervals (e.g., 1h, 4h, and 8h). Each cell contains the two pieces of information our clinician collaborators deemed most vital: (1) the background color encodes the number of events, with darker blue representing more events; and (2) the width of the inner box encodes the proportion of events containing out-of-reference-range values. For example, [boxrule=1mm,colframe=lightblue, height=3mm, width=3mm] indicates that very few events occurred during this period and that most of them were normal, while [boxrule=0.5mm,colframe=darkblue,height=3mm, width=3mm] has the opposite meaning, and may call for an in-depth inspection. A similar design was used in Voila [cao2017voila] to visualize the number of anomalous events of a region on a map.
Observing interesting cells in a particular row (e.g., lab tests), clinicians can brush to select them, and click the “Go Temporal View” button to visualize all records from different items (e.g., lab test items such as ALT, Glucose, and Lactate) in the Temporal View for a detailed investigation and comparative analysis.
In addition to the basic interactions introduced above, offers two additional interactions, linking and marking, to facilitate better visual associations between features and their corresponding records.
Visually associating features and medical records (6). Understanding connections between the feature elements (i.e., rows in the feature list), and medical record elements (e.g., temporal records in line charts and static information listed in the patient’s profile) is not easy. Clinicians may need to scroll through a long list of features and compare names one-by-one. To make this easier, we propose the following novel and intuitive strategy. First, we use small colored bars to indicate the data source (e.g., lab tests and vital signs) for both feature elements (on the right border) and medical record elements (on the left border). Then we draw curves to connect the associated feature elements with medical record elements (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare ModelsE⃝2). These curves are dynamically updated when users scroll down or join additional time-series records into the Temporal View.
Marking on medical records. To support the forward analysis workflow, clinicians are allowed to mark interesting medical record elements with “pins” (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare ModelsE⃝1). Associated feature elements are highlighted with a thicker bar. Clinicians can temporarily remove all other irrelevant features or feature groups by turning on the “focus” switch in the feature view’s left-top corner.
In this section, we first introduce two case studies conducted with two clinicians (P1, P5) for evaluating whether and our proposed workflows (3.3, 3.3) can support clinical decision-making. All clinicians also participated in the pilot study and development process, and are therefore familiar with the system.
6.1 Case Study I - Backward Analysis
We worked with P5, who has 10 years of work experience, to explore and the model’s predictions about a two-month-old infant admitted to the CICU. The patient was predicted to be at high risk for various complications (L, C, A). The clinician was most interested in predicting cardiac complications (C), since they can lead to severe consequences. He first selected a group of patients in the same age range (i.e., infants from 28 days to 12 months) to serve as references (2, 5). This group included 869 patients (Fig. : Connecting the Dots Between Features, Explanations, and Data for Healthcare Models) of which 550 were healthy.
Exploring feature hierarchy (1). The clinician first glanced at the patient’s profile and noticed that two features , surgery time and CPB (cardiopulmonary bypass) time, were much higher than usual. Keeping this in mind, he started exploring the feature view to check the features’ contributions to the predicted complications. In the top level of the feature hierarchy , he noticed that the contribution bar of the “in-surgery” feature group was much longer than that of the “pre-surgery” feature group, which means that the model mainly used information collected during surgery to make the prediction. The clinician then expanded the feature hierarchy and zoomed into a lower level to inspect the detailed explanations. Through sorting and filtering, he settled on a configuration where only the top 5 features or groups with the highest contributions were displayed in the list (1, ). “I like this control function and it helps me narrow down to a more focused display with only a few most important features”, he commented.
Understanding feature contributions (2). He then noticed that the CPB time and the surgery time were the top 2 most important features whose values were both out of distribution . He then commented “This is exactly what I expected. Great to have a confirmation from the model about my previous suspicion”. He further wondered “What would happen if their values go back to normal?”. We reminded him of the what-if analysis function (3). Using this function, he found no noticeable change in the prediction results for both features. However, he noticed that reducing the surgical time to the normal range decreased its contribution significantly. He thought “The exceptionally long surgical time makes this feature positively contribute a lot to the model prediction, but other factors are still playing important roles because the prediction result does not change after what-if”.
The clinician then moved on to the two other features of interest – Oxygen Saturation and Lactate – as they are critical indicators of a patient’s condition. Zooming into the most fine-grained feature level , he discovered that the contributions of these two features mostly came from the Mean features222Other features such as SD and Trend were filtered out due to their insignificant contributions. which were either exceptionally low (Oxygen) or high (Lactate). He suspected such abnormal values should have considerable impacts on the model prediction and confirmed this suspicion after what-if analysis . He then showed further curiosity about the details of these abnormal values and commented “I have to figure out when and why the lactate/oxygen saturation started to accumulate/drop. This is important for me to understand which catalysts, such as a patient’s pre-existing condition or a surgeon’s mistake, cause the results.”. So he selected the corresponding features to review them in the Temporal View (6).
Inspecting features’ influential records (5). After the temporal view was displayed, he immediately observed that the Lactate level was normal at the beginning of surgery, but started to increase after 2:00 PM and eventually went above the reference range after 3:00 PM. However, the Oxygen Saturation was below the reference range during almost the entire surgical period (all the red downward-facing arrows). He commented “I am so impressed by the smooth interaction and intuitive visualization design to guide me here. I think this patient might have cyanotic congenital heart disease, which could be the root cause for the hypoxemia and the lactate accumulation.”. He then decided to continue exploring the direct reason for the Lactate accumulation. He hypothesized that such accumulation was directly caused by the CPB process 333CPB is a technique that temporarily takes over the function of the heart and lungs during surgery, maintaining the circulation of blood and oxygen.. To confirm this, he referred to the timeline view and selected the vital signs during the surgery as references (4).
Taking a close look at the Pulse records, he noticed that the Pulse dropped to a very low level (50 BMP) at 3:00 PM and returned to normal at 5:00 PM . He confirmed this was the CPB period and explained “During this time, the functions of the patient’s heart and lungs were taken over by the CPB pump. That’s why the patient’s pulse looks abnormal.” Comparing this period with the Lactate curve, he then rejected his earlier hypothesis, because the lactate had already reached a high level at 3:11 PM and in that case, the accumulation would have started earlier. Another interesting pattern – a sudden drop of Pulse around 2:30 PM – caught his attention. He thought “This was a rescue conducted at that time and is likely to be the key reason accounting for lactate accumulation”.
Noticing the sudden drop in Pulse, he was curious about whether the model “captured” this information while making predictions (7). He then clicked the “explain” button to toggle the influential segments from the model’s point of view . He noticed that most of the orange (influential) areas covered the CPB period. This fine-grained explanation is slightly different from his expectation – from his perspective, the model should also pay attention to the former sudden drop in Pulse. But in general, he agreed that the prediction was based on the most potentially critical medical records, and was trustworthy.
Summary. Through this exploration, P5 was able to understand the most important features that led to the prediction, and to explore some interesting features and their corresponding records in depth. He decided to pay more attention to this patient, and considered using proactive treatments to avoid the situation getting worse in postoperative care.
6.2 Case Study II - Forward Analysis
We worked with P1 – who has 20 years of experience in this field – to understand a prediction of high-risk lung-related complications made by the machine learning model.
Gaining an overview of patient information (4). The clinician started by checking the patient profile view. She thought everything (e.g., surgical time and CPB time) was normal except for the patient’s age (11 months), which was young for a VSD repair surgery. Then she looked at the timeline view and found the period during surgery (Fig. 5 1). In the row of lab tests, she noticed that most in-surgery test results were in normal ranges, indicated by the small grey inner rectangle. At the same time, vital signs had a slightly higher proportion of abnormal records. After the initial exploration, she found no solid evidence to indicate complication risks.
Inspecting record details (5). She then checked the detailed lab tests and vital signs 2. She commented “I don’t find any big things. The three important indicators, Oxygen Saturation, Pulse, and Lactate, all look clean with no anomalous segments.”. She also noticed End-Tidal CO2 was below the reference range for a long period. Nevertheless, she hypothesized that the patient was not likely to have complications, which contradicted the model prediction. So she planned to refer to model explanations to figure out whether there were factors she had overlooked. She marked all four items 3 and continued to check the explanations in the feature view.
Comparing feature contributions with expectations. By tracing the links to the feature list (6), she noticed that the feature group related to End-Tidal CO2 had a high positive contribution to the high-risk prediction 4. In contrast, features related to the other three items had slight negative contributions. She praised “The explanation algorithm looks amazing. This actually matches what I expected. Now I am curious to see what the influential periods the model thinks to be”. She clicked the “explain” button for help and then obtained the orange-highlighted area 5 which she thought was caused by CPB. The overlapped area with a deeper color also caught her attention, because multiple sub-features identified this area. She then said “This is the critical changing point, but I might need more contextual information to test my thoughts”.
She also noticed that Systolic Blood Pressure, Carboxyhemoglobin (COHb), and pre-surgery Red Cell Distribution Width (RDW) had the highest contributions. Among these, she noticed that the mean value of COHb 6 and RDW 7 was higher than the reference range. She commented “This is beyond my expectation. I know COHb is used to detect carbon monoxide (CO) toxicosis, but I never use this to judge whether a patient will develop complications”. Through further inspection 8, she found that the COHb level was the highest right after the abnormal segment of End-Tidal CO2. She thought “This might be unnoticeable factor in identifying the complications and I want to do further study with my team about it”. As for the high RDW level, she realized that it might indicate that the patient suffered from iron-deficiency anemia, making them vulnerable to (lung) infections. This lab test does not tend to draw much attention from cardiac surgeons, so she had missed it earlier.
Summary. After the exploration, P1 agreed that the patient was likely to have lung-related complications and decided to pay more attention to her. She was also curious about how COHb can be used to identify complications and considered studying it further.
These case studies suggest that is helpful to clinicians and can support them in their decision-making. In addition to the case studies, we conducted semi-structured interviews with P4 and P6 by showing them the case study results and encouraging them to freely explore the system to collect additional feedback. We report and discuss feedback from all four clinicians as follows.
7.1 Design Implications
Feedback from these four clinicians led us to a set of important design considerations for all such projects, which we summarize as follows:
Applications of . All four participants generally commended the usefulness of in supporting diagnoses and expected to use the system to improve their daily workflow. P1 expected to use the system to make more accurate decisions. She said “Everyone sometimes may fall into blind spot. This tool can actually help me reduce the risk of making mistakes”. P4 expected to use the system to communicate better with other clinicians. He commented that “A surgery involves collaborations between teams, , people see data from different angles which might be biased somehow, , I would trust and believe it can greatly facilitate the communication between teams”. Both P5 and P6 suggested using to help junior doctors to make more accurate diagnoses.
Reference-value-based explanations. The reference values are vital in facilitating prediction interpretations for clinicians. Explanations like “the , whose value is below the reference range, has a high contribution to patient’s cardiac complication” are easier for clinicians to understand and accept than purely reporting the contribution scores as confirmed by P5.
Feature hierarchy design. The hierarchical display of features was praised by all participants as it helps them avoid unnecessary details during exploration. Currently, there is no standard for designing the hierarchy of all healthcare features. However, ideas can be borrowed from the clinical forms used for communications between clinicians as suggested by P1.
Providing explanations with context. As demonstrated by the case studies, contextual information helps clinicians to understand explanations. Those in our study appreciated how the various visualization and interaction techniques in the system facilitated visual association between explanations and context. “With the links, I can easily get connections between the features with their corresponding results,”, as P4 said. Also, P1 suggested that “marking” is a very convenient interaction for checking explanations at will.
7.2 Limitations and Future Work
We introduce the limitations of our current work and future plans.
Feature interpretability. Our system only focuses on explaining predictions made from interpretable features (i.e., features that have clear meanings and are extracted from a series of relevant health records). When the feature itself is hard for humans to understand (e.g., features built from representation learning methods), the connections between features and health records can be very complex. In this case, the system will be less effective. An advanced method for tracing and storing such complex connections would be a good addition and remains to be explored.
Potential cognitive biases. Wang et al.’s work [wang2019designing] suggests that a backward-oriented reasoning process (i.e., first acquiring the diagnostic predictions, and then looking for supporting evidence) may lead to confirmation bias. Potential effects of cognitive biases on clinicians’ decision-making when following different analysis workflows, and how our visualization designs may alleviate potential risks, have not been fully evaluated in this work. We plan to study this further by assessing the precision of clinicians’ decisions when using .
Quality of EHR data. The poor quality of EHR data (e.g., missing data) is a challenge to EHR data analysis in general. During the ’s development process, we also found many “False Positive” patterns caused by misrecorded data items (e.g.
, a seeming cardiac arrest pattern was traced back to a faulty sensor). Currently, clinicians’ prior knowledge is required to detect these data defects. In the future, we plan to investigate anomaly detection and visualization solutions to detect and encode any missing information in order to raise clinicians’ awareness of missing data.
Precision of reference values. To improve the usability and precision of the dynamic reference value selection method, we plan to make the following extensions. First, we will automatically recommend relevant cohorts to clinicians for obtaining reference values. Second, we will derive time-varying reference values for temporal records (e.g., Pulse), which are more applicable to surgical scenarios that are composed of multiple stages. Third, we will conduct experiments to understand the stability of the reference values (i.e., how will the reference values change as the cohort changes over time?).
Visual scalability. Scalability issues occur in the temporal view when analyzing a signal with a large number of records. In addition, as the number of test items (rows) increases, finding interesting ones becomes less efficient, as more scrolling is required. In the future, we plan to scale up our approach by 1) segmenting long signals in different scales and 2) using searching and filtering techniques to facilitate the exploration of a vast number of complex signals.
Generalizability to other healthcare models. can be generalized to work on other prediction problems (e.g., mortality predictions) and other ML models using the PIC dataset. However, adaptations (e.g., formal descriptions of the entities and generated features) must be made to use with other EHR datasets (e.g., MIMIC-III [johnson2016mimic]), which is required by the the feature extraction process (introduced in Sec. 4.3). In the future, we plan to improve generalizability by defining system inputs according to the Fast Healthcare Interoperability Resources (FHIR) standard [FHIR], a general EHR data format.
In this work, we identified three key challenges limiting the use of ML in clinical settings, including clinicians’ unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence. We then introduced – a visual analytics system designed according to the requirements identified in a pilot study – to support clinicians using ML to make decisions with both forward and backward analysis workflows. We conducted two case studies and expert interviews with four clinicians. Their positive feedback and in-depth insights demonstrate the usefulness and effectiveness of the system. In particular, it reveals that visually associating model explanations with patients’ situational records can help clinicians better interpret model predictions and use them to make clinical decisions.