1 Related Work
We survey the relevant works in MLbased prediction on clinical data, interpretation methods for ML models, visualizations for XAI, and consistency measurements of prediction models.
1.1 MachineLearning Based Prediction on Clinical Data
For medical predictions on clinical data, two major categories of ML methods are popularly used: tree and recurrent neural network (RNN) based methods.
Treebased methods are appropriate for medical prediction tasks since they characterize decisionmaking after encountering different attribute values for each patient. Besides the nature of prediction tasks, many clinical datasets are in the form of temporal sequences. In this sense, the treebased methods are used to forecast a result at a certain future time step based on the results in past time steps. Ever since the decision tree
[37], much research has been carried out to enhance this method. Random forest [47] makes use of an ensemble of trees to eliminate the overfitting problem. Gradient boosting [14] extends the idea of random forest in a way that tries to boost an ensemble of weak learners. Further, a series of variations [5, 22, 12] in the implementation of gradient boosting are developed to provide more efficient computations and better performance.RNNbased methods are designed for temporal classification and can thus be used to make predictions on clinical data. For example, Lipton et al. [27]
used a recurrent neural network (RNN) with long shortterm memory to classify diagnoses of patients’ temporal sequences. Che et al.
[4]further enhanced RNN to address a problem of data with missing values. While deep learning methods are able to provide high prediction performance, these methods typically need a large dataset (e.g., more than tens of thousand records) for their training.
1.2 Interpretation Methods for Tree and Deeplearning based Models
Various interpretable ML and posthoc analysis methods have been developed. Adadi and Berrada [1] provided a comprehensive survey of interpretation methods. Here, we describe only the methods related to treebased or deeplearning based models.
For treebased models, interpretation methods can be categorized as a global scale, which tries to investigate each feature’s impact on all instances’ predictions (global feature contributions), and a local or instance scale, which describes the contribution of each feature for each instance’s prediction (local feature contributions). There are several ways to calculate feature contributions. For example, Palczewska et al. [34] introduced a method calculating the local increment value from a parent to a child node corresponding to one feature. Another example is the TreeInterpreter [43], which obtains each feature’s contributions by going back through each decision path from a leaf to its root.
As for deep learning used for medical predictions, Choi et al. [6] built a variation of RNN, called RETAIN. By integrating the attention mechanism in RNN, RETAIN provides interpretable results while keeping prediction performance of RNN. Also, Kwon et al. [26] modified RETAIN so that it also provides the attention value for each feature for each time step.
Besides interpretation methods for specific models, there are also modelagnostic methods. Modelagnostic methods perform generic model explanations with common components of different models. LIME [41] is an example of an interpreter for any type of model. Another example is SHAP [31], which presents the impact of each feature value on the prediction. For a more comprehensive list of modelagnostic methods, refer to the survey by Adadi and Berrada [1].
1.3 Visualizations for XAI
Liu et al. [28] surveyed the recent progress on visualizations developed to understand, diagnose, and refine ML models. For example, Wang et al. [50] developed an interpretation approach to review the inner mechanisms of a complicated deep neural networks. Manifold [51] is a framework for visually interpreting, debugging, and comparing ML models. For deep neural networks, Wang et al. [50] developed an interpretation approach to review the inner mechanisms of a complicated neural networks. RuleMatrix [32] provides rulebased explanations to allow the users with little ML knowledge to navigate and validate ML models. As described in [28], while there are many more related works, in the following, we focus on tree and RNNbased models, which are commonly used for clinical data.
Several works provided visual analytics methods for treebased models. For example, Zhao et al. [52] developed a comprehensive interface for interpretations of random forest. Their visual analytics system focuses on model simplification and decision path extraction for a selected group of patients. Another example is TreePOD [33] that is developed for aiding decision tree selection through visually exploring candidate trees. Collaris et al. [8] performed a case study on instancelevel visual reasoning of random forest in the scenario of fraud detection.
As for RNNbased methods, for example, Jin et al. [19] built a clinical decision assistance system with RETAIN. For one selected patient, their system provides which past events have a strong influence on the ML decisions, potential treatment outcomes, and a summary of similar patients’ health records. These sets of information help clinicians make their decision with confidence. Guo et al. [17] created a scalable interface to aggregate event sequence records of patients based on the RNN model they devised. Wang et al. [49] produced a matrix of small multiples to visually reason about feature attributes (i.e., attention values of their RNN model) as well as a time sequence view to make comparisons.
Although numerous visual analytics methods have been developed [28] , methods for treebased models still have not been fully studied. More specifically, methods for comparing multiple models based on their interpretable information (e.g., local feature contributions) are still missing.
1.4 Consistency Measures of Model Interpretations
On the most generic level, a machine learning algorithm’s consistency can be defined as its stability when there are small perturbations of the input data. A learning algorithm’s stability was first investigated by Devroye and Wagner [11], in which they observed the quantitative results of leaveoneout error. Since then, many works have been done on probabilistic analysis of learning algorithms’ stability [10, 30, 23]. Then, Bousquet and Elisseeff [2] defined the notion of uniform hypothesis stability to derive the generalization error bounds.
In addition to the probabilistic analyses on stability, statisticians also developed the definition and theories of learning algorithms’ consistency. With the definition in [24], an algorithm is consistent if it always returns a concept that is consistent with the given examples [24]
. Then, the definition of PAC (Probably Approximately Correct) learning algorithm
[18] is introduced, which is an extension of consistent learners.In our visual analytics scenario, the consistency of each ML method’s inner rationales is of our interest. Therefore, besides the stability or consistency of the learning algorithm, we also survey the measures of dependency between two random variables, which characterize the dependence of an ML model’s rationales (e.g., local feature contributions) on the actual feature values. To begin with, Pearson’s correlation coefficient provides a measure of the linear relationship between two variables
[35]. Later on, Spearman [46] extended Pearson’s correlation to nonlinear relationships. However, the Spearman’s correlation coefficient is limited to monotonic dependencies between variables. With the development of information theory, there are works summarizing the shared information between the two random variables. For example, mutual information [9], maximal information coefficient [38], and total information coefficient [40] have been introduced. All these measures do not have assumptions of the random variables’ distributions and, thus, these measures are more suitable when the distributions are unknown and the relationships are nonlinear.2 Analysis Questions (AQs)
As mentioned in the previous sections, the general goal of this study is to leverage visual analytics when reasoning and comparing multiple models’ interpretations. Here, we list more detailed analysis questions that we want to answer with our visual analytics system. These questions have led us to design the methods described in Sec. 3.
 AQ1

Do multiple models (even trained with similar ML methods) have different internal criteria for predictions and how different are they?
An answer to this question will show the importance of interpretations and comparisons of multiple models. For example, the gradient boosting method and its variations have the same theoretical background. We want to first know whether these methods have significant differences in their prediction criteria. Afterwards, we can move on to the next level of analysis.
 AQ2

Which model likely has a higher consistency in its prediction criteria and should be more trusted consequently? Furthermore, which range of which features are more reliable within a model’s prediction criteria. After observing and understanding the differences among models, we want to know which models should be chosen. We should not rely on the model that radically changes its predictions when input feature values are slightly changed (i.e., inconsistency in their decisions). To find a more reliable model, our system should help answer this question.
After getting an overview of different models’ consistency, users would like to know the details of model’s consistency for each individual feature, so that they can further select the reliable features to trust when viewing the predictions. Moreover, even within one feature, the consistency could vary across each range of feature values. For example, a model could have high consistency in the prediction for patients who are over 60 while having low consistency for younger patients. Therefore, our system should also support these analyses.
3 Methodology
We describe our dataset, prediction task, ML methods, interpretation measures, and consistency measures of interpretations. These are used in our visual analytics system, as described in Sec. 4.
3.1 Data and a Prediction Task
We use the MIMICIII dataset [20], which is a large, openaccess clinical database composed of deidentified critical care unit admission records for over 40,000 patients. The dataset includes demographics, vital sign measurements, admission information, test results, medications, procedures, and mortality.
From this dataset, our prediction task is to foresee the chance of an inhospital mortality, given a patient’s current and previous admission records. This large database consists of patients with more than 14,000 types of diagnoses, and thus it is irrational to predict the status of patients with drastically different diagnoses. To concretize our prediction task, and to make the ML models’ interpretable information more reasonable, we extract admission records that contain the same diagnosis, specifically patients diagnosed with Atrial Fibrillation (AF). Then, we process the database into a tabular dataset, containing extracted relevant features. As a result, we obtain 8 features from 12,886 AF patients. These features include demographic information, admission status, and information within their inpatient stay (e.g. the number of ICU stays, icustays_num).
Although we use a specific dataset for the development of our visualization interface, our analysis methods and visualizations are designed to be applicable to other clinical datasets.
3.2 Machine Learning Methods
For our analysis, we use six different treebased methods: decision tree (DT) [37], random forest (RF) [47], gradient boosting decision tree (GBDT) [14], light gradient boosting machine (LightGB) [22], CatBoost [12]
, and XGBoost
[5]. These six methods are chosen for their wide usage for clinical predictions and often provide satisfying performance.As described in Sec. 3.3, we use modelagnostic interpretation methods to understand the models’ rationales. Therefore, although we use these six methods throughout the rest of the content of the paper, our methodology in the ensuing subsections is generic enough to apply on different ML methods, including deep learning models.
3.3 Analysis Methods for Understanding Models’ Internal Criteria for Their Predictions
We measure feature contributions to compare multiple ML models’ internal criteria for their predictions (AQ1). Given an ML model and the classes of the prediction target, a feature contribution represents how strongly each feature affects the prediction results. Typically, for a binary prediction task, a feature contribution can be either a positive (contribute to positive class), zero (neutral), or negative (contribute to negative class) value. In terms of granularity, there are global and local feature contributions. A global feature contribution represents a general effect of that feature to the overall prediction across all records whereas a local feature contribution shows an impact of each individual record of a feature to the corresponding prediction.
To answer the first part of AQ1–having an overview of multiple models’ inner criteria–adopting either global and local feature contributions should be sufficient enough. However, for the second part of AQ1 and AQ2, we should offer comparisons at a more detailed level. Therefore, instead of obtaining an overview of each feature’s impact on the predictions (i.e., global contributions), we have decided to measure local feature contributions.
To obtain local feature contributions, for DT and RF, we use the method described in [34], and for the other methods, we adopt the SHAP value [31]. Between the two methods, the SHAP value is modelagnostic, and, thus, can be adapted to measure feature contributions of any other models.
Though we use treebased ML models for this study’s experiments, we still employ modelagnostic interpretation methods for two reasons. Firstly, while the theoretical background of the treebased models are similar, each of them still employs a different technique and provides a different interpretation method. Using a modelagnostic method provides a fair comparison across the models. Second, employing such interpretation methods can help our methodology be more generalizable for other potential ML models.
Let
be a vector of feature values of the
th data record (). can be represented as where is the number of features and is the th feature value of the th data record. We obtain the local feature contributions of all features for each data record. Specifically, where is the local feature contributions of and is the th feature’s contribution of the th data record. Because there are models, for each model, we compute a set of such local feature contributions with features for each of data records. As a result, in total, we have feature contributions vectors with length . For example, our dataset described in Sec. 3.1 has and . Also, we compare models described in Sec. 3.2. Thus, in our case, we will get vectors with length .However, it is difficult to review a large set of feature contributions (e.g., 77,316 vectors) onebyone. Thus, to effectively obtain an answer for AQ1, we provide an intuitive overview of the feature contributions’ similarities across the different models. The visualized example can be found in Figure 5. We adopt the dimensionality reduction (DR) methods, such as tSNE [48], to project these vectors of dimension onto a 2D plot. By using DR methods, data points with similar feature contributions will be placed close to each other. In addition, the data points can be colorcoded by their corresponding prediction models. Therefore, by reviewing the distributions of clusters visually appeared in the DR result in conjunction with the color information of the prediction models, the users can explore how the local feature contributions vary among multiple models. Refer to Sec. 5 for the detailed analysis example.
3.4 Consistency Measures of Models’ Decision Criteria with Interpreted Information
After we compare each model’s internal criteria for its predictions by using the methods described in Sec. 3.3, we want to analyze the consistency of each model (AQ2). As described in AQ2, we consider that a model has high consistency when its criteria for the prediction is robust for a small perturbation in an input feature value. Since local feature contributions characterize how each feature value contributes to the prediction, we concretize the definition of consistency as follows.
 Consistency of the model’s internal criteria:

The model’s internal criteria have a lower consistency when local feature contributions are more independent from input feature values (the feature contributions are decided randomly regardless of feature values). On the other hand, the criteria have a higher consistency when local feature contributions have a higher dependency on input feature values (the feature contributions are more decisive based on feature values).
For example, two scatterplots in Figure 1 visualize the local feature contributions (direction) against the feature values (direction) for two different ML models. Here we have the same number of samples for each model (12,886 samples for each). We consider that Model A corresponding to a has a lower consistency than Model B for b. For example, for the input values in a range from 0 to 25, Model A has random feature contributions. Thus, for such input values, Model A keeps changing how much it should rely on the corresponding feature. This shows Model A’s low consistency in its criteria for the prediction.
With the definition above, we can obtain consistency with “measures of dependence” [40] between the input feature values and local feature contributions. Measures of dependence capture how strongly two variables are dependent to each other. For example, Pearson’s correlation coefficient is one of the most popular measures of dependence. As we can see the example in b, feature values and local feature contributions often form nonlinear dependency. Therefore, we decide to use measures that can be used to capture both linear and nonlinear dependencies. Moreover, it is ideal to use measures that do not have any assumption of the variables’ distributions. The recently developed measures such as the mutual information (MI) [9], maximal information coefficient (MIC) [38], and total information coefficient (TIC) [40] fulfill the above requirements. Among these measures, TIC is known for the best measure for various datasets [39, 42]. We also tested these three measures on our dataset and TIC produced the most reasonable conclusions. Therefore, we have decided to use TIC for measuring consistency. Comparisons of these measures are discussed in Sec. 6.
4 Visual Analytics System
We describe our visual analytics system using the methodology described in Sec. 3. As shown in Figure 2, the system consists of four main views. The first two views, Figure 2a and b, are developed for an overall comparison between different models’ internal criteria of their predictions (AQ1); whereas the last two views, Figure 2c and d, can be used for a detailed comparison of the models’ consistency (AQ2). We provide a demonstration of the user interface as a supplementary video^{1}^{1}1The demonstration of our UI, https://www.youtube.com/watch?v=KBZYcwEo43Q.
4.1 Overall Comparison of Models’ Interpretations (Figure 2a and b)
Using the method described in Sec. 3.3, an overview of the similarities of each model’s local feature contributions, named the Admission Overview, is visualized in Figure 2a, as each point represents an admission record with information of the ML model used for the prediction. We employ tSNE [48] as a dimensionality reduction (DR) method because it is suitable to find patterns (e.g., clusters) in a large dataset (77,316 data points in Figure 2a). Specifically, we use the openTSNE [36] implementation for the fast computation and precise control of tSNE’s parameters. We color each point based on which model it belongs to. We use categorical colors with enough differences in hues to distinguish from each other. Also, we set color transparency to be able to see overlapped points. By viewing the points’ positions for the same model and the distances among points for different models, the user can verify the diversity of different models’ rationales. For example, if two models have only a small number of overlaps, they have different prediction mechanisms (e.g., green and cyan points in Figure 2a).
The Admission Overview provides a lasso selection with mousedragging. The lasso selection allows the user to select a cluster of points. Also, as shown in Figure 2a, the user can select multiple clusters. The selected clusters are indicated with the drawnlasso shapes with the identifying numbers (e.g., 1⃝ and 2⃝ in Figure 2a). Based on the selection, the user can review the detail differences in local feature contributions of the selected cluster(s) from the other points in the Feature Contributions view (Figure 2b).
The Feature Contributions View in Figure 2b shows a table in which cells contain histograms for the comparison of the distribution of local feature contributions. As shown in Figure 2b and Figure 3, each row and column correspond to a certain feature and one of selected clusters, respectively. Then, each cell shows the distributions of the local feature contributions of the corresponding feature for the selected cluster and others. As indicated in the legend of Figure 3, the pink and gray histograms correspond to the selected cluster and others, respectively, where coordinates represent the relative frequency. We have decided to use these two colors to differentiate the categorical colors in the Admission Overview, which are used to represent the models. We should note that since the selected points could be members from multiple different models, we cannot assign the same color used for the model instead of pink. By comparing the height of pink and gray bars, we can understand the differences of the selected cluster’s prediction criteria from others. For example, in Figure 3, cluster 2 tends to have higher feature contributions for feature 1 than compared to that of other points. Thus, we can say cluster 2 highly relies on feature 1 for its prediction. Also, by comparing the histograms of each row, the user can observe which feature’s distribution varies across multiple selected clusters.
Through the analysis using the Admission Overview and Feature Contribution View, users can understand which features have high contributions to the predictions of multiple selected sets of points. Together with the information of the selected points’ models, users can move on to the analysis of the consistency using Figure 2c and d.
4.2 Comparison of Models’ Consistencies (Figure 2c and d)
After understanding the general differences or similarities among multiple models, we move on to the comparison of different models’ consistencies in their inner rationale of decision making (AQ2).
We first visualize the dependencies between each feature’s contributions and the values in the Model Summary View (Figure 2c). In the scatterplots of this view, each point’s coordinate represents the performance measure (accuracy rate (ACC) or area under the curve (AUC)) of the corresponding model. The user can select one of these performance measures. Each point’s coordinate represents the measure of dependency, specifically TIC in our case. While ACC or AUC is the measure for each model, TIC is the measure for each feature of each model. Thus, we use a horizontal line to represent each model and then, within each horizontal line, each circled dot conveys TIC for each feature. Additionally, we use the rectangle shape to indicate the average TIC of all features to show the overall consistency of the model. For the cluster information, we use the same categorical colors with the Admission Overview (Figure 2a) to link the two views. Since the measures of prediction performance and consistency are encoded in the  and coordinates respectively, it is intuitive to observe that the models whose rectangular dots are closer to the upper right corner of the plot produce more accurate, trusted results. Across the different features for each model, the predictions that highly relies on features (circled dots) that have high TCI can be more trusted when the corresponding model’s rationales are viewed.
By hovering over each point, the user can see its detailed information (feature name, , and values). Also, all circled dots corresponding to the hovered feature will be highlighted with gray outerrings in horizontal lines of other models for clearer comparisons. Note that black outerrings show the selected circled dots as explained in the description below. The hovered example can be seen in Figure 7. Through this summary view, the user can understand an overview of how different features of different models distribute in terms of their consistencies and the prediction performance.
After obtaining the summary of the consistency of each model and each feature within a model, the Consistency Charts (Figure 2d) can be used to verify the results in the Model Summary View as well as compare the consistency of different ranges of feature values. To choose the feature and models the user wants to analyze in detail, the user can select one or multiple circled dots from which correspond to a certain feature (i.e., points with the same feature name, but different colors).
For the selected point(s), the Consistency Charts (Figure 2d) provide a visual explanation of the calculated consistency in the Model Summary View as well as a more detailed inspection on the level of feature values. Because the features we extracted from the MIMICIII dataset are either continuous or categorical, we provide a different visualization for each type of feature. Users can switch between different features by clicking on the points in the Model Summary View. For continuous features, in the top view of Figure 2d, we provide scatterplots of feature contributions (coordinates) against feature values (coordinates) as similar to the plots in Figure 1. To be informed about the selected models, we use the same categorical color with the Admission Overview and the Model Summary View. For each model’s scatterplots, we also provide a regression line that shows how the feature contributions change in relation to the values. Since the relationship between feature contributions and feature values is often nonlinear, we adopt LOESS regression [7]
, which is a widely used nonlinear regression method. Then, in the bottom view of
Figure 2d, we also plot the residuals of the regression using coordinates. coordinates represent feature values as similar to the top view. This plot can assist the user to quantitatively measure how well the feature contributions behave with the distribution of feature values. Comparisons between models can be done by overlapping the scatter plot and the residual plot of each model. For categorical features, we provide a plot of points representing mean values and error bars of the local feature contributions for each feature value, as illustrated in Figure 4. Comparisons between models can be done by comparing the range of error bars for each feature value. In order to make comparisons easier, we plot different models’ points and error bars with a small gap between each of coordinates instead of simply overlaying them using the same coordinates.By reviewing how widely points are distributed along the direction for each coordinate, the user can understand which model and/or which range of feature values has higher consistency. For example, in Figure 2d, we can see that the yellow model generally has higher consistency than the blue model. Furthermore, within the yellow model, when the feature values are smaller, the residuals tend to have higher absolute values. Thus, when the yellow model predicts the result for the patients who have low values for this feature, the model’s prediction criteria have low consistency.
5 Case Study
Using the preprocessed MIMICIII dataset (refer to Sec. 3.1), we compare the six ML models described in Sec. 3.2 in terms of their prediction performances, internal rationales for predictions (AQ1), and their consistencies (AQ2).
5.1 Models’ Prediction Performances
As described in Sec. 3.2, we trained six models with DT, RF, GBDT, LightGBM, CatBoost, and XGBoost. We then obtained area under the curve (AUC) and accuracy rate (ACC) for each model, as shown in Table 1. From Table 1, in terms of prediction performance, we can say that XGBoost has the best performance while other methods excluding DT have similar performance with XGBoost.
5.2 Overall Comparison of Models’ Internal Prediction Criteria (Aq1)
After normalizing different models’ local feature contributions for each patient, we performed tSNE on these feature contributions. The tSNE plot was then colorencoded and shown in the Admission Overview, as demonstrated in Figure 5.
From the overview of similarities of local feature contributions, we can observe a divergence in the positions of the points representing the feature contributions, whereas points that belong to the same models tend to be more clustered together. However, there are exceptions for such cases. For example, the points of models CatGB (yellow) and Gradient Boosting (red) have many overlaps and no distinguishable boundaries, although some areas are more dense with points of one model than another. This implies that these two models seem to share more similar prediction criteria compared to others. However, we can find two general tendencies from Figure 5. First, most of these six models’ inner criteria tend to have differences from each other as we can see distinct clusters for each model. Second, even within the same model, each model tends to have different prediction criteria based on patients’ features. For example, for LightGB (green), while there are several clusters around the top left of Figure 5, we can see a distinct cluster around the bottom center of Figure 5.
Then, we select two clusters of points within the overview to review the differences of their local feature contributions in detail. As seen in Figure 5, Cluster 1 contains mostly LightGB points, while Cluster 2 contains mostly DT and RF points. The Feature Contributions View visualizes the local feature contributions’ distribution of these two sets of points, as shown in Figure 6. By comparing each feature within the same row, we discover that there are significant differences for features ethnicity, age, diagnoses_num (number of diagnoses), los (length of stay), and adm_loc (admission location), as highlighted in Figure 6. In the following subsection, we use feature los as an example to demonstrate the use cases of the remaining views.
5.3 Comparison of Models’ Consistencies (Aq2)
We then move on to the Model Summary view to analyze the differences in consistency of the feature los across models. Here, we focus on the three models of which local feature contributions are included in the selected two clusters (i.e., LightGB in Cluster 1; DT and RF in Cluster 2). As shown in Figure 7, the highlighted points correspond to feature los of the three models. As observed in Figure 7, LightGB’s los feature has a much higher TIC than DT’s and RF’s. Therefore, it can be inferred that clinicians can rely more on los when they adopt LightGB model while they cannot rely as much on los when using DT or RF.
By clicking the corresponding points for feature los of LightGB and DT as an example, we analyze the detailed differences of their consistencies with the Consistency Charts, as shown in Figure 8. Through the comparisons between the overlaying scatterplots, we can see that it is intuitive and convincing that LightGB has more consistency than DT because the points tend to be more closely distributed around the regression line. Furthermore, when looking at the residual plots, as shown in Figure 8, we observe that DT’s residuals tend to have a range wider for small values of los, whereas LightGB’s tend to have small residuals for any values of los. Through this view, we can tell that LightGB’s trends of los’s feature contributions can be more trusted.
Through this case study, we demonstrate how effectively we can answer AQ1 and AQ2 with our visual analytics system. Here we have only shown a certain flow of analysis. However, we can try various different sets of selections and gain more comprehensive insights. For example, the user would also want to select and analyze different points in Figure 5. Such analysis and exploration can be performed with the flexible interactions supported in our system.
6 Discussion
We discuss our algorithm choice and visualization design. Then, we describe the limitations of our methods and future work.
6.1 Dependency Measures
During the selection process of algorithms, we tried several measures of dependency between two random variables to characterize consistency. Since we use the Consistency Charts to verify the calculated dependency values, we also performed comparisons of the dependency measures with this view. As stated in Sec. 3.4, MI, MIC, and TIC could be used to measure the dependency between the feature contributions and feature values. Thus, we tried all these measures and viewed the relationships between the computed dependencies and visual results in the Consistency Charts.
Through this comparison, we observed that although the three measures provided reasonable results for most of features, MI and MIC have capricious behaviors when evaluating dependencies, especially when there are large ranges of feature contributions for the same feature values (i.e., many different coordinates for the same coordinates in the Consistency Charts). On the other hand, TIC is more stable for any types of features and produces more rational results. For example, as shown in Figure 9, both MI (a) and MIC (b) indicate that feature los of model DT has a high dependency value, even higher than CatGB. However, as shown in Figure 10, by reviewing the detailed consistency information, we noticed that CatGB (yellow) should have a higher dependency value than DT (blue). This is because their range of feature contributions over similar feature values are relatively smaller than DT, and their residuals are closer to 0. By adopting TIC as the dependency measure, the unexpected behavior of the previous two measures has been solved, as we have already shown in Figure 7. Thus, we have chosen TIC for our consistency measure. This example from our experiment shows that both the analysis algorithm and related visualization can be evaluated by coupling with each other and comparing them and, as a result, we can select a better algorithm and/or visualization.
6.2 Visualizations for Categorical Features
Similar to continuous features, for categorical features, we first designed the Consistency Charts using both scatterplots of distributions (e.g., the top view of Figure 10) and an assisting chart that shows how much local feature contributions differ at the same feature value (e.g., the bottom view of Figure 10
). For categorical features, the counterpart of the regression line of continuous features is the mean or median value for each category; the equivalence of residuals is the error bars calculated around the mean or median (e.g., bars showing standard deviations). However, the mean or median and error bars are usually plotted together. Therefore, instead of showing them in two different views, we first decided to follow this common way (i.e., showing both in one view).
However, as shown in Figure 11, by following this format, the visualized results suffered from occlusion and cluttering. This happened because the points (red and teal colored dots) and error bars shared the same coordinate. Because the categorical values usually would not take many different values (e.g., about 70 in Figure 11), we have enough space to use slightly different coordinates around each corresponding categorical value. Thus, as shown in Figure 4, we tried a plot placing different models’ points and error bars with a small gap in coordinates. In this way, we were able to view and compare the means and error bars for different models clearly. Therefore, we decided to use this design for our visualization.
6.3 Limitations
Scalability for the number of features and ML models. Our visualizations provide enough scalability for the number of data records (e.g., patients). For example, the Admission Overview employs tSNE [48] for dimensionality reduction and can visualize the overview of similarity of local feature contributions even for tens of thousands of points. However, for the number of features and the number of ML models, our visualizations have limited scalability. The number of ML models we can support is limited because we use colors to indicate the corresponding ML model. Thus, our visualizations can deal with less than about 10 models. We can address this problem by aggregating multiple models based on their similarities in a certain aspect. For example, as shown in Sec. 5.2, CatGB, and Gradient Boosting have similar distributions of local feature contributions and, thus, the user might want to analyze them as one aggregated model.
For the number of features, we would need to improve several designs if the data were to contain many features. For instance, the Feature Contributions View in Figure 2b shows all features’ information by aligning them for each row. This way is reasonable for our dataset consisting of 8 features. However, when there are more than 10 features, showing and analyzing all features’ information is not realistic. In this case, the system should automatically suggest which features the user should review to understand the differences between the selected points from the Admission Overview (Figure 2a) and others. For example, we can support such functionality by using the method introduced in [15].
Variety of analyzed ML methods.
In the case study of our clinical data predictions, we adopted the treebased ML methods. In terms of generalization, one future direction is to develop heuristic comparison methods for any kind of ML methods, including deep learning methods. For example, to extend our analysis on time series prediction methods, besides the reliability of each feature, we would also like to compare consistency of different time steps. For deep learning methods, the reliability of the last few layers’ outputs, for instance, should also be our focus of comparisons.
Supported Analyses. Our methods and visualizations can help understand many aspects of the ML models’ prediction rationales. However, there are still some points that cannot be explained by the current system. For example, while we can analyze how each model relies on some features more than the other features as their learned results, we cannot know how each model obtained such criteria. More specifically, as stated in Sec. 5.2, through the analysis, we found that CatGB and Gradient Boosting tend to have similar local feature contributions. However, we cannot further analyze why these two methods reached such results. Therefore, our work needs to be extended to cover such exceptions. For such analyses, we can incorporate the methods developed for understanding the learning process of ML models. For example, for the treebased ML methods, we can use the method by Liu et al. [29].
7 Conclusions
We have developed a visual analytics system that utilizes quantitative methods to observe and compare multiple models’ reliability through their interpretable information. Using our system, insights of multiple models’ internal criteria can be obtained and their reliability can be further evaluated on both overview and individual feature levels. Through the case study, we have demonstrated the usefulness of this visual analytics to aid clinical researchers’ model selections. Our visual analytics system can be extended to have additional support for various machine learning methods and a more scalable interface that provides functionalities for a more comprehensive analysis.
Acknowledgments
This research is sponsored in part by the U.S. National Science Foundation through grant IIS1741536 and a 2019 Seed Fund Award from CITRIS and the Banatao Institute at the University of California.
References

[1]
A. Adadi and M. Berrada.
Peeking inside the blackbox: A survey on explainable artificial intelligence (xai).
IEEE Access, 6:52138–52160, 2018.  [2] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
 [3] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730, 2015.
 [4] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multivariate time series with missing values. Scientific Reports (Nature Publisher Group), 8:1–12, 04 2018.
 [5] T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016.
 [6] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. In Proc. Advances in Neural Information Processing Systems, pp. 3504–3512, 2016.
 [7] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368):829–836, 1979.
 [8] D. Collaris, L. M. Vink, and J. J. van Wijk. Instancelevel explanations for fraud detection: A case study. CoRR, abs/1806.07129, 2018.
 [9] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience, New York, NY, USA, 2006.

[10]
L. Devroye.
Exponential inequalities in nonparametric estimation.
In Nonparametric functional estimation and related topics, pp. 31–44. Springer, 1991.  [11] L. Devroye and T. Wagner. Distributionfree performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, Sep. 1979.
 [12] A. V. Dorogush, V. Ershov, and A. Gulin. CatBoost: gradient boosting with categorical features support. CoRR, abs/1810.11363, 2018.
 [13] M. Fouad, M. Mohamoud, M. Hagag, and A. Akl. Prediction of long term living donor kidney graft outcome: Comparison between rule based, decision tree and linear regression. International Journal of Advanced Research in Computer Science, 3:185, 04 2015.
 [14] J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
 [15] T. Fujiwara, O.H. Kwon, and K.L. Ma. Supporting analysis of dimensionality reduction results with contrastive learning. IEEE Transactions on Visualization and Computer Graphics, 26(1):45–55, 2020.
 [16] A. S. GoldfarbRumyantzev, J. D. Scandling, L. Pappas, R. J. Smout, and S. Horn. Prediction of 3yr cadaveric graft survival based on pretransplant variables in a large national dataset. Clinical Transplantation, 17(6):485–497, 2003.
 [17] S. Guo, Z. Jin, D. Gotz, F. Du, H. Zha, and N. Cao. Visual progression analysis of event sequence data. IEEE Transactions on Visualization and Computer Graphics, 25(1):417–426, Jan 2019.
 [18] D. Haussler. Part 1: Overview of the probably approximately correct (pac) learning framework. http://web.cs.iastate.edu/~honavar/pac.pdf, 1995. Accessed: 20191219.
 [19] Z. Jin, S. Cui, S. Guo, D. Gotz, J. Sun, and N. Cao. CarePre: An intelligent clinical decision assistance system. ACM Transactions on Computing for Healthcare, 2019.
 [20] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Liwei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. MIMICIII, a freely accessible critical care database. Scientific Data, 3:160035, 2016.
 [21] E. Kawaler, A. Cobian, P. Peissig, D. Cross, S. Yale, and M. Craven. Learning to predict posthospitalization vte risk from ehr data. In AMIA Annual Symposium Proceedings, vol. 2012, p. 436. American Medical Informatics Association.
 [22] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.Y. Liu. LightGBM: A highly efficient gradient boosting decision tree. In Proc. Advances in Neural Information Processing Systems, 2017.
 [23] M. Kearns and D. Ron. Algorithmic stability and sanitycheck bounds for leaveoneout crossvalidation. Neural Computation, 11(6):1427–1453, Aug 1999.

[24]
M. J. Kearns and U. V. Vazirani.
An Introduction to Computational Learning Theory
. MIT Press, Cambridge, MA, USA, 1994.  [25] J. L. Koyner, K. A. Carey, D. P. Edelson, and M. M. Churpek. The development of a machine learning inpatient acute kidney injury prediction model. Critical Care Medicine, 46(7):1070—1077, July 2018.
 [26] B. C. Kwon, M.J. Choi, J. T. Kim, E. Choi, Y. B. Kim, S. Kwon, J. Sun, and J. Choo. RetainVis: Visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Transactions on Visualization and Computer Graphics, 25(1):299–309, 2018.
 [27] Z. C. Lipton, D. C. Kale, C. Elkan, and R. C. Wetzel. Learning to diagnose with LSTM recurrent neural networks. In Proc. International Conference on Learning Representations, 2016.
 [28] S. Liu, X. Wang, M. Liu, and J. Zhu. Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics, 1(1):48 – 56, 2017.
 [29] S. Liu, J. Xiao, J. Liu, X. Wang, J. Wu, and J. Zhu. Visual diagnosis of tree boosting methods. IEEE Transactions on Visualization and Computer Graphics, 24(1):163–173, 2017.

[30]
G. Lugosi and M. Pawlak.
On the posteriorprobability estimate of the error rate of nonparametric classification rules.
IEEE Transactions on Information Theory, 40(2):475–481, March 1994.  [31] S. M. Lundberg and S.I. Lee. A unified approach to interpreting model predictions. In Proc. Advances in Neural Information Processing Systems, pp. 4765–4774, 2017.
 [32] Y. Ming, H. Qu, and E. Bertini. Rulematrix: Visualizing and understanding classifiers with rules. IEEE Transactions on Visualization and Computer Graphics, 25(1):342–352, Jan 2019.
 [33] T. Mühlbacher, L. Linhardt, T. Möller, and H. Piringer. Treepod: Sensitivityaware selection of paretooptimal decision trees. IEEE Transactions on Visualization and Computer Graphics, 24(1):174–183, Jan 2018.
 [34] A. Palczewska, J. Palczewski, R. Marchese Robinson, and D. Neagu. Interpreting Random Forest Classification Models Using a Feature Contribution Method, pp. 193–218. Springer, Cham, 2014.
 [35] K. Pearson and F. Galton. Vii. note on regression and inheritance in the case of two parents. Proc. of the Royal Society of London, 58(347352):240–242, 1895.
 [36] P. G. Policar, M. Strazar, and B. Zupan. openTSNE: a modular python library for tsne dimensionality reduction and embedding. BioRxiv, p. 731877, 2019.
 [37] J. R. Quinlan. Learning efficient classification procedures and their application to chess end games. In Machine Learning: An Artificial Intelligence Approach, pp. 463–482. Springer, 1983.
 [38] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. Detecting novel associations in large data sets. Science, 334(6062):1518–1524, 2011.
 [39] D. N. Reshef, Y. A. Reshef, P. C. Sabeti, and M. Mitzenmacher. An empirical study of the maximal and total information coefficients and leading measures of dependence. Ann. Appl. Stat., 12(1):123–155, 03 2018.
 [40] Y. A. Reshef, D. N. Reshef, H. K. Finucane, P. C. Sabeti, and M. Mitzenmacher. Measuring dependence powerfully and equitably. The Journal of Machine Learning Research, 17(1):7406–7468, 2016.
 [41] M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should i trust you?”: Explaining the predictions of any classifier. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, 2016.
 [42] S. Romano, N. X. Vinh, K. Verspoor, and J. Bailey. The randomized information coefficient: assessing dependencies in noisy data. Machine Learning, 107(3):509–549, 2018.
 [43] A. Saabas. TreeInterpreter. https://github.com/andosa/treeinterpreter. Accessed: 20191219.
 [44] T. Shaikhina, D. Lowe, S. Daga, D. Briggs, R. Higgins, and N. Khovanova. Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation. Biomedical Signal Processing and Control, 2017.
 [45] E. H. Shortliffe and M. J. Sepúlveda. Clinical Decision Support in the Era of Artificial Intelligence Clinical Decision Support in the Era of Artificial Intelligence Clinical Decision Support in the Era of Artificial Intelligence. JAMA, 320(21):2199–2200, 12 2018.
 [46] C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 100(3/4):441–471, 1987.
 [47] Tin Kam Ho. Random decision forests. In Proc. International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282 vol.1, Aug 1995.
 [48] L. van der Maaten and G. Hinton. Viualizing data using tsne. Journal of Machine Learning Research, 9:2579–2605, 11 2008.
 [49] C. Wang, T. Onishi, K. Nemoto, and K.L. Ma. Visual reasoning of feature attribution with deep recurrent neural networks. In Proc. IEEE International Conference on Big Data, pp. 1661–1668, 2018.
 [50] J. Wang, L. Gou, W. Zhang, H. Yang, and H. Shen. Deepvid: Deep visual interpretation and diagnosis for image classifiers via knowledge distillation. IEEE Transactions on Visualization and Computer Graphics, 25(6):2168–2180, June 2019.
 [51] J. Zhang, Y. Wang, P. Molino, L. Li, and D. S. Ebert. Manifold: A modelagnostic framework for interpretation and diagnosis of machine learning models. IEEE Transactions on Visualization and Computer Graphics, 25(1):364–373, 2018.
 [52] X. Zhao, Y. Wu, D. L. Lee, and W. Cui. iForest: Interpreting random forests via visual analytics. IEEE Transactions on Visualization and Computer Graphics, 25(1):407–416, Jan 2019.
 [53] T. Zheng, W. Xie, L. Xu, X. He, Y. Zhang, M. You, G. Yang, and Y. Chen. A machine learningbased framework to identify type 2 diabetes through electronic health records. International Journal of Medical Informatics, 97:120–127, 2017.
Comments
There are no comments yet.