1 Background
Our approach is heavily based on existing work [15, 17] on partial dependence plots and other visualizations that use the partial dependence calculation to generate explanations. The implementation and usage of these plots is described below.
1.1 Partial Dependence Plots
Partial Dependence Plots (PDPs) [15] calculate the average prediction across all instances as the value of a single feature is changed, holding all other values constant. PDPs are typically constructed for each feature in a dataset  a sample of PDPs for the bike dataset are presented in Figure 2. Intuitively, the partial dependence curve shows the best guess for a prediction if only one feature value is known. Figure 1 is used to generate a partial dependence curve for a single predictor feature:
(1) 
N is the number of items in the dataset, pred is the function defined by the predictive model, f is the predictor feature in question, and v is a value in the domain of f. The model is treated as an oracle and generates N curves constructed of M data points each, where M
is a hyperparameter that determines the granularity of the explanation.
v takes on the values of the Mquantiles of f.Partial Dependence is one of the most common ways to communicate how a prediction depends on a single feature. The plots are easy to calculate and interpret, and have become fixtures in opensource [3, 43, 4] and proprietary [19]datascience toolkits. In addition to the traditional line chart, they have also been presented as colored bars [31] and 2feature heatmaps and contour plots (see Figure 1). Note that PDPs (and other plots in this family) can be presented with the standard scale (in which the Yaxis is read as the predicted value) or as a centered PDP (in which case the Yaxis is read as the change from the average prediction). Figures 2 and 1 present the standard scale, however we use the centered PDP in the rest of this work.
However, PDPs have three main shortcomings:

Unreasonable assumption of feature independence.
The synthetic data generated by PDPs may be highly unlikely under the joint distribution, if the input features are correlated. For example, in a dataset of personal health records, predictions would be generated for children up to the maximum height in the dataset, perhaps 6’ tall. The predicted target might be outlandish, and skew the summary curve in regions with low probability mass.

Heterogeneous effects are obscured by the summary curve. The process of averaging the curves produced for each data point necessarily obscures varying shapes.

Feature interactions are difficult to separate from the main variable effect. The PDP curve includes all feature interactions, making it difficult to isolate the importance of the feature of interest itself.
1.2 Individual Conditional Expectation plots
To address PDPs’ tendency to obscure heterogeneous effects, [17] presented Individual Conditional Expectation (ICE) plots, which disaggregate the PDP line into its constituent curves, one for each data point in the original dataset. The ICE plot consists of a line plot with one series of predictions for each instance. While ICE plots display the full heterogeneity of effects, they inherit the other weaknesses of PDPs. Moreover, they scale poorly with the number of data cases, as they tend to overplot significantly, obscuring potentially interesting curves.
2 Related Work
In this survey, we review only a relevant subset of the large literature on interpretable machine learning. Specifically, we focus on methods for tabular datasets, and ignore the areas of interpretability for image and text data, in which the notion of a feature is very different. We also narrow our focus to model explanations that are primarily visual.
2.1 Global vs. Local Explanations
As previously discussed, one of the major axes in this space divides techniques into global explanations and local explanations
. Global explanations present a summary of the model without regard to specific instances or subsets of interest to the user. Inherently interpretable models are one type of global explanation. These models include linear regressions, decision trees, and rule lists. Users can generally comprehend these moodels by simply reviewing their internals (e.g. the coefficients and bias for a linear regression). While these have been frequently lambasted for subpar performance, a new class of inherently interpretable models
[34, 50, 9, 1] have emerged in recent years, promising predictive power comparable to black box models for some problems. Feature importance scores [6, 14, 8] and feature interaction scores [23, 18, 16] are global explanations as well. The latter are discussed in more detail below.A Partial Dependence Plot [15] is a common form of global explanation. The entire model can be summarized with a series of single line charts (one for each feature in the model). The effect communicated is an aggregate behavior that may not represent the prediction process for any specific instance. ICE plots could perhaps be nominally categorized as a local method, since they bind one encoding (a curve) per data point. However, in practice, overplotting obscures many of the points, and no prior work has provided utilities for a user to inspect a single point’s ICE curve. Therefore, they are more accurately viewed as a global explanation that provides some additional information over PDPs.
Many visual analytics systems for model analysis and debugging (see the excellent survey in [22]) employ model summaries as one of the available views. While these systems tend to focus on the internal elements of neural networks or other specific model types, these overviews are another type of global explanation.
Local explanations focus on the prediction for a single data case. The major use cases for these approaches include consumeroriented applications (why was my loan application denied?) or model debugging. Furthermore, these explanations mirror the techniques humans use to explain causality to each other [39].
Local explanations are sometimes communicated using a counterfactual; for example, “the prediction for this instance would move from negative to positive if feature X changed by Y%.” One method for developing these explanations is the Growing Spheres algorithm [32], which identifies the nearest dissimilar prediction in the data space and generates an explanation from the differences in the two points. Prospector [31] uses partial dependence curves to allow users to interactively generate synthetic data points that serve as counterfactuals.
Another class of local explanations uses prototypical examples of correct and incorrect classifications to explain a model [27]. This and other exemplarbased approaches do not provide an explanation perse, but rather operate on the premise that an explanation will be relatively clear to a subject matter expert once the examples are surfaced (e.g. they will note that all of the incorrect classifications had a particular unusual value for a certain feature).
The most wellknown local approach is the LIME algorithm [44]
which fits a model on a set of points randomly drawn from a Gaussian distribution centered on the dataset’s mean. Intuitively, this builds a model that captures the impact of slight movements around the data space centered on an instance of interest. The resulting model (typically linear) is sparse and interpretable.
2.2 Regional Explanations
There are clear downsides to both global and local approaches. Global approaches by definition sacrifice complexity and fidelity to the original model for simplicity. At the same time, local models tend to only be appropriate for specific use cases  a data scientist could not realistically debug a model by generating LIME explanations for 10 random instances out of a dataset of 100,000 records. In other words, existing local approaches provide no indication as to how they generalize beyond the instance in question.
VINE falls under the category of regional explanations, a novel category description under which we believe several pieces of prior work can be fruitfully categorized. Regional explanations split the difference between global and local approaches by describing behaviors that affect significant regions of the data space. This affords more generality than local approaches, but yields more specificity than global approaches. Regional explanations can be thought of as exceptions to global behavior  the global behavior (e.g. a PDP line) applies unless a data case falls in a specific cluster. We define regional explanations as explanations that meet at least one of the following criteria:

C1 An algorithm identifies a region of the data space in which many of the points share a common behavior in the model. A succinct description of this cluster is provided.

C2 The common behavior for this data space is described.
Below, we review related work that qualifies under this definition.
2.2.1 Subsetselectionbased approaches
Many visual analytics systems provide utilities for users to select an arbitrary subset(s) of interest either by predicate or direct manipulation. Users can then compare outcomes such as accuracy, or model internals such as nodes in a neural network. The GridViz application was developed by Google to help them understand a model for advertising click predictions by visually comparing slices of the data [38]. MLCubeExplorer displays a wide variety of distribution, prediction, and correlation data about subsets, with the intent of comparing the relative values of two models [26]. ActiVis [25]
allows a user to select instances of interest from a visualization of model results, and compare them in a “neuron activation matrix” view that can surface common activation channels.
While these approaches meet criteria C2 above, they do not meet C1, as the subsets are not algorithmically generated. While interactive cohort construction is undoubtedly a useful tool, we argue that these approaches do not extract subsets which the model itself treats differently, and which may or may not correspond to human intuition.
2.2.2 Rulebased approaches
A wide variety of classifiers use a system of rules to make or explain predictions. Often, these rules take the form of a predicate (if feature X
a, then predict positive). In this section we focus on rulebased methods that are specifically engineered for providing explanations  we do not consider a 10layer decision tree interpretable by the average human.One class of rulebased approaches consists of inherently interpretable models that use a series of rules to make a prediction [1, 33, 16]. RuleMatrix [40] uses a similar approach to generate rules that describe an underlying model, then presents the rules in an interactive visualization. Anchors [45] are modelagnostic explanations that use highaccuracy rules to define model behaviors. If a data case meets the criteria defined by a predicate, then it is highly likely that the anchor’s predicted value holds, regardless of the values of the other features of the data case. However the rules extracted by this algorithm do not cover much of the data space.
While these rulebased approaches clearly define regions (C1), the behavior that they define for the region is very coarse, consisting of a single value (the prediction).
2.2.3 Clustering Approaches
Explanation Explorer [28], Rivelo [49] and related tools [30] generate local models for each datapoint, consisting of a minimal list of features that would strongly affect the prediction if changed. Datapoints are aggregated into clusters based on having identical or similar local models. Users can then view the details of instances in the cluster, as well as their evaluation metrics (e.g. the number of points predicted for each class, accuracy, etc). [29] presents Class Signatures, which expand on these methods by clustering instances by feature importance lists AND prediction, thus creating more nuanced groups.
These tools deal exclusively with binary features and a binary target  the data type can be either tabular or text. This approach defines regions (C1) of the dataset, but due to the nature of binary features, there is less need to describe behavior (C2) for the cluster. The authors note that their approach is more finegrained than feature importance scores. While this is true, tabular datasets with numerical and ordinal features require more complex expressions of behavior, for which partial dependence curves are wellsuited.
Shapley Additive Explanations (SHAPs) [35]
leverage a wellestablished game theory method to generate feature importances
[48], and extend this technique to include variables representing feature interactions. These variables are then combined into an additive explanation for each point in the dataset. While this explanation is not sparse on its own, it allows instances to be clustered based on the ordering of feature importance values. The authors annotate their visualizations with handcurated labels for clusters that are found to correspond to shared realworld explanations (e.g. these data points were predicted to have low income because they are young and single). However, it should be noted that while SHAPs automate the identification of regions (clusters), they do not algorithmically generate sparse explanations for these clusters. Moreover, VINE provides granular visualizations of the interaction behavior due to our use of ICE curves, whereas SHAPs are not primarily a visualization tool.2.3 Feature Interactions
Regional explanations can also be understood as a form of statistical interaction effect between two features, when the effects of two features upon a prediction are nonadditive. Prior work has primarily focused on quantifying the strength of these nonadditive relationships via interaction scores. While the literature uses a variety of terms for these effects, such as statistical interactions, feature interaction effects, and nonadditive interactions, we use the term feature interactions throughout.
2.3.1 Model Types that Inherently Explain Interactions
Several types of GLMs (Generalized Linear Models) include features that model interaction effects. RuleFit [16] is a modified linear regression that includes interaction terms which are derived from the splits generated by tree ensembles. This method for generating interaction terms, and subsequent pruning with a regularization algorithm, ensures that RuleFit models are relatively sparse.
Another type of GLM is GAMs (Generalized Additive Models). A GAM is essentially a linear model in which each feature can be modified by a link function which enables the model to capture nonlinear (say, logarithmic or quadratic) relationships between the feature and the target. A modified version, GAM’s, adds interaction terms consisting of two features, which are again modified by a link function [34]. The GAMUT visual analytics system [21] uses GAM curves (as well as instancebased explanations) as a model explanation tool, in much the same way as partial dependence curves.
Both GAMs and RuleFit present clear explanations for individual features, but suffer from the difficulty of interpreting interaction terms. GAMs are incapable of modeling feature interactions.
2.3.2 Scorebased
One method for measuring interaction strength is the Hstatistic [16], which compares the 2D partial dependence function for two features against the sum of the individual partial dependence functions for each feature. The loss is used to generate the interaction score, as it captures the degree to which additive explanations fail to recapture the target. Partial dependence functions have also been leveraged to calculate feature interactions [18]
. This method observes the partial dependence function for feature A at various intervals of feature B, and calculates the variance in the PD function across all points. Intuitively, this method treats features A and B as independent if feature A’s importance to the model remains constant regardless of feature B’s value. While these methods generate numerical scores, the authors of their respective papers choose to communicate the scores with simple graphics, such as bar charts. Arguably, this is a natural mode of expression for this data.
These methods only identify the presence and strength (in terms of average impact on a prediction) of feature interactions. They do not indicate regions of a feature’s range in which interactions might be particularly strong or weak or the shape of the function that expresses the interaction.
2.3.3 Visual
In a Variable Interaction Network (VIN) [23], features are displayed in a stylized network graph in which connections indicate the presence of an interaction. This method is notable for its ability to efficiently identify interactions including 3 or more terms. The interactions are identified by an algorithm that uses a permutation method similar to feature importance scores [6] to identify features whose effect changes in the presence or absence of a potential interactor feature. The algorithm then cleverly prunes the search space by using the property that an interaction effect can only exist if all the lowerorder effects that involve its feature also exist. Similar to the Hstatistic, Variable Interaction Networks do not communicate granular detail about the nature of interactions, only their presence.
ICE and PDP plots can be extended to communicate feature interactions, in ways which leverage their visual properties but do not generate interaction scores directly. [15] suggests a heatmap partial dependence plot, in which color is encoded as the average predicted value for all points in the 2D space defined by two features. This method visualizes feature interactions as color artifacts, such as sharp gradients or large areas with no variation (see for example [41]). Similarly, ICE plots can encode a second variable as the color of a line [17]. The most simple effect would be a correlation between hue and Yvalue which would indicate that two features have a positive superadditive interaction effect.
Partial Importance (PI) plots and Individual Conditional Importance (ICI) plots [8] operate much as PDP and ICE plots but visualize feature importance instead of prediction value. This is a regional approach in the sense that it visualizes the regions of a feature’s range in which it impacts predictions. The authors note that high variance between individual curves in an ICI plot suggests the presence of feature interactions.
ALE plots [2] are a solution to the aforementioned tendency of PDPs to generate inaccurate curves where features are highly correlated. ALE plots instead calculate partial dependence from small piecewise segments consisting of points with values in a narrow range, removing the need for synthetic data. These plots address the issue of feature interactions by allowing the user to view the feature’s main effect, and any interaction effects in separate plots.
2.4 Summary
The major downside to all of these approaches is that they require significant user time and skill, and there is no predefined threshold for a “significant” feature interaction. A data scientist would likely need to generate a scatterplot matrix of all possible feature combinations, or try one by one, perhaps with interaction scores or a VIN as a pruning mechanism. While these methods have enormous value in the process of exploratory data analysis, they are less suited for effectively communicating model properties.
Our approach to interaction effects is to present them when they serve as a relevant explanation for model behavior. We split the difference between relatively coarse interaction scores, and complex charts. VINE is therefore not a pure feature interaction score and does not directly compete with measures such as the Hstatistic. Rather VINE curates a selection of feature interactions that aids the interpretation of model behavior by providing exceptions to global behavior. A user of VINE interprets a model using the global explanation (a PDP curve) except where a data case meets certain criteria (a region in which feature interaction effects occur). We argue that this is a parsimonious but powerful explanatory technique capable of communicating both feature interactions and nonlinear relationships while not overwhelming the user with many instancelevel details.
3 Approach
Our approach is to create a visualization for model explanation that leverages modified ICE plots and to present these plots in a visual analytic tool called VINE. We generate VINE curves via the following steps:
An example of this algorithm is presented in Figure 3. We believe that this process produces accurate regional explanations for model behavior in the form of partial dependence curves which apply to a subset of the dataset.
3.1 Calculating Clusters
To address the issue of overplotting on ICE curves, VINE tries to cluster similar curves and visualize a centroid curve instead. Note that this is a form of unsupervised clustering on the dataset, but that instead of using an instance’s feature vector as its representation, we instead use the X,Y tuples that constitute its ICE curve. We assessed a variety of clustering algorithms and distance metrics with the goal of generating accurate clusters quickly. Accuracy was initially assessed by visually comparing the centroids against the constituent ICE curves to validate that clusters were cleanly separated. In particular, we assessed the following clustering metrics, using implementations from scikitlearn [43]
: DBSCAN, KMeans, Affinity Propagation, Agglomerative Clustering, and Birch. We found that Agglomerative Clustering
[52] and Birch [54] both performed acceptably, with Birch running approximately 2.5x faster, but producing less cleanly defined clusters. Agglomerative Clustering is used for all examples in the paper, although Birch can be selected as an option when running the script.A more difficult question was the choice of distance metric for calculating the pairwise distance between ICE curves. Euclidean distance produced groups of curves which a human would clearly recognize as inappropriate. While Dynamic Time Warp produced clusters that appeared highly appropriate to the eye, we were unable to identify a fast implementation. This was necessary because all pairwise distances must be calculated, meaning that our algorithm scales in time, where K is the number of features and N is the number of items in the dataset. We also tried the Slope Similarity algorithm, which compares the Euclidean Distance between the slopes of ICE curves instead of their raw points. The Slope Similarity measure produced appealing results as well, and and ran in the same time as Euclidean Distance, making this an ideal choice for our purposes.
3.2 Generating Cluster Explanations
After clustering the ICE curves we try to provide a humaninterpretable explanation for each cluster of curves– that is, what do these clustered curves have in common that differentiate them from the rest of the ICE curves? To answer this question, we used a 1deep decision tree to predict membership in that cluster against all other points (onevsall).
This simple model identifies the feature and split value that most reduces the entropy between the curves in the cluster and those outside of the cluster. Intuitively, this split represents a good explanation for what characteristics make the cluster unique.
3.3 Merging Clusters
One difficulty with our method was in choosing the appropriate number of clusters for each dataset. We simulated exploratory data analysis (EDA) with early versions of the tool and found that some features would produce 5 or more distinct clusters of behavior (itself an interesting result), but that for other features, many of the cluster explanations would be duplicative, or nearly so (e.g. two clusters with the explanation ). All else being equal, it is preferable to have fewer clusters so as to reduce the visual complexity of the chart and to allow users to focus on a few highly salient behaviors. To prune the list of explanations, we chose to implement a cluster merging operation, given the lack of any a priori indicator for the ideal number of clusters. In practice, we noticed that the accuracy of a merged cluster is usually higher than the mean accuracy of two clusters with similar explanations.
Merging was accomplished via the following process:
Here, each cluster’s explanation has a “feature” property (the feature used to define the split), a “direction” property ( or ) and a value property (e.g. 3) which together define a predicate (). f is the feature for which the plot is being generated.
4 Implementation
Our algorithm was built in Python 2.7, using standard machine learning libraries, including Numpy, Pandas, Scipy, and ScikitLearn [43]. In addition, the original code for calculating PDP and ICE curves was forked from the PyCEBox library [3], though it has been heavily modified in our implementation. We also employed the sklearngbmi package [20] to calculate Hstatistics. The charts in the paper were generated with Altair [51] and Matplotlib [24]. The VINE visual analytics system was built in HTML using D3.js [5]. It consumes a JSON file that is output by the Python script.
VINE initially presents the user with a feature space visualization designed to communicate the relevance of each feature to the model (see Figure VINE: Visualizing Statistical Interactions in Black Box Models). VINE charts as presented as small multiples, one per feature. The Xaxis indicates the strength of feature interactions. The Yaxis indicates the overall feature importance. This allows the user to quickly familiarize themselves with the dataset and its salient features. Both the charts themselves and their position in the feature space draw the eye to interesting patterns. For example, in Figure VINE: Visualizing Statistical Interactions in Black Box Models, Hour of Day is clearly the most important (topmost) feature, which can be verified by checking its Yaxis scale. Work Day, on the bottom right, is not a particularly important feature most of the time, but it does have one interaction (the red bar) which produces an outsize effect. Because this effect is so different than the PDP term, Work Day has a strong feature interaction score and occupies a sparse corner of the feature space. Wind Speed, on the other hand, has no VINE curves at all, and so appears on the left side of the graph.
The feature interaction strength (Xaxis) is calculated as the sum of Dynamic Time Warp distances between each VINE curve and the PDP curve, normalized by the maximum value of the PDP curve. Feature importance (the Yaxis), is determined by the standard deviation of the PDP curve. The position should be taken as a rough approximation, as a force layout is used to prevent overplotting of the small multiples.
Users can select a feature to enlarge the chart, which makes the explanations visible. VINE charts are displayed in the same manner as PDP and ICE plots. The VINE chart for feature A will have feature A’s range as the Xaxis. The Yaxis depicts the change compared to the mean prediction. We chose to meancenter each plot to enable an additive interpretation, i.e. for a given data point, a user would sum the values from each plot to arrive at a prediction, rather than the traditional PDP which requires the values to be averaged. The partial dependence curve is presented as a black line. Each colored line represents a VINE cluster and is calculated as the centroid of all its constituent ICE curves (in other words, a partial dependence curve for the subset). The width of each VINE curve encodes the size of its cluster, but is logscaled for readability purposes. Clicking on a VINE curve reveals all its constituent ICE curves. This allows the user to visually inspect the quality of each VINE curve.
Binary features are presented as bar charts instead of lines, to aid in their interpretation and visually distinguish them from numeric features. However, the underlying VINE algorithm is applied identically to each feature. The bar charts use the same color scheme as the line charts, with black corresponding to the PDP. A bar should be interpreted as the change in prediction incurred by increasing a feature from 0 to 1.
Lastly, the histograms on the right side provide a visual depiction of the explanation for each cluster. The histograms can be mapped to a VINE curve based on color. One or more columns will be displayed depending on the number of features that appear in explanations. In Figure 8, the Hour of Day feature serves as the best explanation for two of three VINE curves. The darker green region of the histogram conveys both the range defined by the explanation and the density of points in that region. The text of the definition, the size of the cluster and its accuracy are displayed in the top righthand corner of the chart.
5 Evaluation
We evaluated the VINE algorithm on three benchmark tests, including an application of the Information Ceiling framework. Each test was performed on three datasets using a regression target and a single model fit for this task.
5.1 Datasets
VINE was evaluated on three tabular datasets with numerical, ordinal, and categorical features. Preprocessing consisted of onehot encoding any categorical features. Ordinal features, such as
Monthfor the Bike dataset, were left as is. These datasets did not have missing or erroneous values and so no imputation was performed. Due to the choice of a treebased model, normalization/standardization was not necessary. The version of the Bike dataset stored in the UCI repository has several standardized features  these were transformed back to their original domain for readability purposes. Several features were removed from the Bike dataset
[12] in order to produce a more intelligible model. Weekday and holiday were removed because they were raw versions of the engineered Workingday feature. Dteday and Month were removed for similar reasons, because they were better represented by the Season feature. The Casual and Registered variables were removed because they are alternate regression targets, and highly correlated with the Cnt target. Feature names for the Bike dataset have been changed to make them more humanreadable for figures and use cases in this paper.For all datasets, a Gradient Boosting Regressor was used. Each regressor used 300 trees and a minimum leaf size of 100 to prevent overfitting. The accuracy of each classifier is generally high and is reported in Table
. Hyperparameters were manually selected to produce decently accurate classifiers that were faithful to the underlying dataset, but beyond these basic measures, no attempts were made to identify an optimal model. For our purposes, the model is of more interest than the dataset or the relationship between the two. For this same reason, the entire dataset was used to fit the model, as there is no use for a test set in our evaluation. That said, VINE should perform similarly on unseen data as long as it was drawn from the same distribution as the data used to fit the model.Note that for all three datasets, we chose a regression problem as our task. However, we believe that binary or multiclass classification problems can easily be tackled with VINE as well, as PDP and ICE curves are also suitable for these tasks. The major difference for these tasks is that the interpretation of the Yaxis alters from “change in prediction” to “change in probability of a given class”.
5.2 Comparison to Random Clustering Baseline
We first attempted to evaluate the efficacy of our algorithm for generating clusters and their corresponding explanations. We sought to ensure that our cluster explanations were accurate and that they outperformed a nominal baseline approach. The purpose of this check was to demonstrate that our decision tree would not simply overfit random clusters to a nonsensical explanation, and that a real signal must be present in the cluster in order for a highaccuracy explanation to be generated.
To evaluate the explanations, we compared the data points contained in each cluster (set A) with the data points returned by filtering the dataset on the cluster explanation (set B). By treating set A as a training set and set B as the model output, we were able to apply traditional accuracy, precision, and recall metrics. For this evaluation, we set the hyperparameter for number of clusters to 5.
To generate a baseline comparison, we used the following method to generate random clusters:
5.3 Correspondence to Hstatistic Results
We believe that the explanations our method returns should be consistent with existing methods that quantify feature importance and feature interactions. Assuming that feature A interacts strongly with features B,C,D according to a measure such as the HStatistic [16] or Greenwell’s partial dependence interaction [18], then we expect to see that cluster explanations for feature A will include feature B, C, and/or D, allowing for the possibility that other features may be included as well, due to the fact that an interaction may only be strong in a narrow range.
To this end, we evaluate our cluster explanations using Friedman’s Hstatistic, which generates a score between 0 and 1 for each pair of features. 0 indicates no interaction between the two features, and 1 indicates that the features have no main effects, but rather that their entire impact on the prediction is generated from their interaction. For a given Feature A, we generate a list of features that appear in its cluster explanations (list A). We compare list A against the list of feature interactions, ordered by the Hstatistic (list B).
Given that one of the issues with the Hstatistic is the lack of a wellestablished threshold for determining significance, we chose to ignore the values themselves and instead calculate the number of elements from list A that appear in the top 3 features of list B. We then sum this count across all features in the model, and normalize it by the total number of clusters generated by VINE. The result can be interpreted as the percentage of explanations that utilize a strongly interacting feature. We also present the baseline probability that features would have appeared among the top 3 interactors if they were chosen at random (this probability is constant for each dataset, equal to ).
5.4 Information Ceiling
We introduce a novel framework, the Information Ceiling, for evaluating the fidelity of any visual model explanation to its underlying model. For our tabular regression problems presented here, the metric simply consists of the (often known as the Coefficient of Determination) between the model’s predictions and our algorithm’s predictions as it tries to simulate the human sensemaking process afforded by the model visualization in question. The tricky part here is to describe and systematize a process by which a consumer of a visualization would use it to make a prediction. Nonetheless, as this is one of the most common human tasks used to evaluate visualizations [13], we argue that it behooves the designer of model visualizations to build them according to standard humancomputer interaction principles, with specific tasks in mind.
Luckily, for VINE curves and other plots in the PDP family, a fairly simple method presents itself for making predictions based on the explanation. For the PDP, the chart for Feature A allows the user to identify the value contributed to the prediction at any point on the Xaxis (e.g. the range of Feature A). To find this component of a prediction for instance X, a user simply has to find instance X’s value for Feature A, find that point on the Xaxis, and follow it up to the PDP line. This will yield Feature A’s contribution to the prediction. The user can simply sum the results of this process for each feature in the dataset, add the sum to the mean value of the target variable, and yield a prediction based on the PDP curve. This process is summarized in Figure 4.
While it is unlikely that a user would perform this exact task in practice, a heuristic version is more likely. A user would notice that an instance of interest has high values for Features A,C, and D, and remember that the PDP curved sharply upwards for Features A and C. The user would add some estimated amount to an average value for the target, and produce a prediction in this manner. This method is recommended in
[23] as a workflow for data scientists when using partial dependence plots to analyze a model.This method can easily be extended to ICE and VINE plots, as summarized in Figure 4. For ICE curves, the user simply selects the particular curve for the instance of interest instead of a PDP line. For VINE, they select (much more easily) the VINE curve whose predicate matches their instance. For VINE, two edge cases must be considered: (1) when a point matches 2 or more predicates, we take the mean of each of their predictions, and (2) when a point doesn’t match any predicate, we use the PDP line for prediction instead.
This method is easy for an algorithm to simulate when presented with the data that underlies each of the curves. It should be noted that we do not expect any user to derive predictions as accurately as our algorithm can. Instead, we treat our metric as the upper limit on prediction fidelity (or a lower bound on error) that could possibly be achieved by interpreting the visualization in this way. For this reason, we refer to this evaluation framework as the Information Ceiling.
6 Results
We report performance on three algorithmic benchmark tests across three datasets. Each of the benchmarks was devised specifically for this paper. A Jupyter notebook with the full code required to reproduce all results, charts, and tables in this publication is available at https://www.github.com/MattJBritton/VINE. Instructions, code, datasets, and other files necessary to run VINE as a standalone tool are also available at this URL.
6.1 Comparison to Random Clustering Baseline
Figure 5 indicates that VINE cluster explanations more accurately describe real subsets than randomly chosen subsets. We take this as evidence that VINE explanations detect real descriptions of subsets, and do not simply fit noise.
6.2 Correspondence to Hstatistic results
Table 2 presents the results of the Hstatistic experiment. The feature used in VINE explanations occurs in the top 3 interactors (sorted by Hstatistic) about twice as often as we would expect it to if features were selected randomly. This suggests that the VINE algorithm successfully measures feature interactions. Note that the Hstatistic calculation (and the random baseline) are nondeterministic, so results will vary across iterations. Results for one pass are reported.
6.3 Information ceiling
Our Information Ceiling method shows that VINE curves have higher fidelity to the model than PDPs (see Figure 6). In addition, our method outperformed Individual Conditional Expectation plots in two of the three datasets. We conclude that our method can be considered a more accurate representation of a model’s behavior than PDPs. In addition, it appears that the method for calculating individual conditional expectation has fundamental limitations, which may be caused by the aforementioned issues with extrapolation. Even when a prediction is generated for a data point based on its own ICE curve, the prediction is scarcely better than the PDP line (for two of the three datasets). We hypothesize that when VINE aggregates ICE curves, it averages out instabilities, which is ample tradeoff for the loss in specificity.
7 Discussion
7.1 Contributions
Our contribution consists of (1) an algorithm that clusters ICE curves based on shape similarity and generates a human readable label for that subset, (2) a visual analytics tool that facilitates model interpretation and sensemaking using VINE explanations, and (3) a framework for evaluating visual explanations of machine learning models based on the loss that an automated method incurs when using them as a basis for prediction.
7.2 Strengths
(1) Our algorithm is completely modelagnostic. The only requirement is that the model’s prediction function be passed into the export method and that this prediction function uses the same API as scikitlearn [43].(2) VINE curves extract salient feature interactions and give detailed information about how they affect predictions. Identifying these feature interactions is as simple as reading the chart, and does not require a detailed statistical analysis.(3) The Information Ceiling framework allows us to compare the validity of multiple visualizations in the partial dependence family for the first time.
7.3 Limitations
(1) Our approach is currently limited to tabular data and does not work for text, image, or video data. (2) Our approach works best when most features in a dataset are numerical. Ordinal, categorical, and Boolean features are supported, but existing methods [29, 30, 28, 49]
are better adapted to this task. In particular, onehot encoding a categorical variable or creating a vectorized text representation can create a confusing array of features. (3) Large datasets (
50,000 rows) will take at least several minutes to compute and may use a large amount of memory.7.4 Potential Use Cases
Our tool extracts model behavior that differs significantly from the mean feature effect (the partial dependence curve). This has enormous potential value for debugging both the model and the training set. We used an early version of our tool to perform some model debugging on the Month attribute of the bike dataset. Data cases with a Season of Spring and a month from JulyDecember had markedly lower predicted ridership than the PDP average. However, this combination of features (Spring in December) is impossible. A data scientist could take this insight and either build a validation rule for data intake, or more likely drop one of the two highly correlated features.
Another use case is the extraction of insights. Our tool can partially automate or supplement exploratory data analysis. In Figure 7, an analyst viewing the Hour feature in the bike dataset would note that VINE has found two regions of interest. The blue VINE curve is for weekends/holidays (Workingday=0). The PDP curve shows a large bump in ridership at the morning and evening rush hours. However, for weekends, this effect is far less pronounced, with ridership increasing steadily but less sharply. The insight that weekday and weekend ridership patterns are fundamentally different is presumably valuable for a bikesharing company. These insights are extracted without the user being aware of the importance of the Workingday feature or making any intentional effort to analyze it.
Figure 7 visualizes the effect of the Feels Temperature (temperature + wind chill) on ridership. The PDP curve indicates that the model predicts a large spike in ridership around 75 degrees. However, the VINE curves reveal a more nuanced story. Later afternoon and evening ridership (the blue curve) spikes higher, while early afternoon and morning ridership (the red curve) stays mostly flat until the temperature becomes very hot. Moreover, the green curve indicates that on warm winter days, ridership spikes particularly high and at a lower temperature. A model explanation communicated with only the PDP curve might convince the bikesharing company to reduce the size of the fleet during the winter. VINE flags the potential for highly profitable warm winter days. It is unlikely that this correlation would have been discovered unless someone thought to check for it explicitly.
7.5 Evaluation Framework
We believe that the Information Ceiling metric can be used to validate the effectiveness of a wide array of visualizations in the interpretable ML space. While we only consider PDP, VINE, and ICE plots here, it would be trivial to compare ALE plots too. Commentators [41] have noticed that whereas PDP plots suffer from extrapolation into sparse areas of the conditional distribution, ALE plots can suffer from a related tradeoff between accuracy and stability when setting the hyperparameter for number of intervals. An easy way to determine the superior method is to evaluate them using our method, which quantifies fidelity to a model.
Beyond visualizations in the partial dependence family, our Information Ceiling framework could also be used to evaluate explanations such as RuleMatrix [40], in which the algorithm would simply scan through rules in the order they are presented in the visualization until it found a matching predicate for a given instance. Similarly, Gamut [21] or other GLMs can be evaluated in much the same way as PDPs, essentially using a feature plot as a lookup table for each instance and then adding predictions together.
Pushing the envelope further, it should be possible to evaluate LIME [44], creating a direct comparison between global and local explanations for the first time. One approach, based on our personal model interpretation workflows, is as follows: (1) generate k cluster centroids using a method such as kmedoids, (2) build a LIME model for each centroid, and (3) make a prediction for an instance by finding the nearest centroid and using its LIME model. Clearly, there are many undefined parameters here, such as the value k or the distance metric to use. It is likely that the human sensemaking process for this task is difficult to replicate as an algorithm. However, we argue that there is value in investigating this process, under the assumption that it will not be possible to design a good model visualization (or indeed, any visualization) without a sense of its intended use.
It should be stressed that we do not recommend evaluating explanations solely by our method. Our method is not capable of measuring the aesthetic value or ease of interpretation of an explanation, only its information content. We believe our Information Ceiling framework can instead set a ceiling on the understanding that a human can glean from an explanation. It should be noted that this is not a new concern in information visualization  prior research has investigated the fidelity of visualizationgenerating algorithms such as tSNE [53] or even histograms [36] to the underlying data.
Other research [42] has investigated the design and perceptual factors involved in model visualization. This research is a necessary complement to our work, addressing how to “make the most” of the Information Ceiling that a visual model explanation affords. Performance on any prediction task performed by a human can be compared directly to the Information Ceiling, and the loss can be explained by either design issues with the visualization, or human perceptual limitations (e.g. people have been shown to exaggerate certain features of line charts and downplay or excise others [37]).
8 Future Work

Investigate the effectiveness of partial dependence across datasets. While it cannot be proven from this limited study, the far higher performance of the ICE plot on the Diabetes dataset suggests that the fidelity of partial dependence curves may be contingent on some unknown property of a dataset, such as the presence of multicollinearity. The Information Ceiling provides an ideal tool to probe the limitations of partial dependence methods and the impact of violating their assumptions. Given the wide use of the technique, this metalearning could be valuable.

Evaluation. We present a novel evaluation framework for model visualizations in which we seek to quantify their information content. We hope that this method can be used both to evaluate the fundamental validity of other techniques in interpretable ML and to guide future studies into human sensemaking with predictive models. We also hope that VINE can be evaluated in situ to determine its utility for data scientists.
9 Conclusion
We present VINE, an interactive visualization that communicates regional explanations for models built to make predictions on tabular data. Our approach leverages existing work into partial dependence plots to derive groups of instances whose behavior in a given model significantly differs from the mean feature effect. These regional explanations also capture feature interaction effects in a novel manner. We argue that our approach provides a useful complement to global explanations by identifying caveats to the main behaviors, and also complements local explanations by aggregating similar instances and distilling common contributors to their prediction. Our approach has applicability to model interpretation and explanation, model debugging, algorithmic fairness and adversarial artificial intelligence.
We demonstrate that our algorithm produces explanations that have high mean accuracy in describing relevant subsets. We also provide example use cases that demonstrate how VINE can facilitate model exploration on a real dataset. We evaluate VINE against PDPs using a novel evaluation framework (Information Ceiling) and find that VINE more faithfully replicates the predictions made by the model. We conclude by discussing ways in which the Information Ceiling approach can be used to quantify the effectiveness of visualizations in the interpretable ML space.
Acknowledgements.
The authors wish to thank Fred Hohman and Andrea Hu.References
 [1] E. Angelino, N. LarusStone, D. Alabi, M. Seltzer, and C. Rudin. Learning certifiably optimal rule lists for categorical data. The Journal of Machine Learning Research, 18(1):8753–8830, 2017.
 [2] D. W. Apley. Visualizing the effects of predictor variables in black box supervised learning models. arXiv preprint arXiv:1612.08468, 2016.
 [3] R. Austin. Pycebox, Jan 2018.
 [4] P. Biecek. Dalex: Explainers for complex predictive models in r. Journal of Machine Learning Research, 19(84):1–5, 2018.
 [5] M. Bostock, V. Ogievetsky, and J. Heer. D3: Datadriven documents. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis), 2011.
 [6] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 [7] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM, 2015.
 [8] G. Casalicchio, C. Molnar, and B. Bischl. Visualizing the feature importance for black box models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 655–670. Springer, 2018.
 [9] C. Chen, K. Lin, C. Rudin, Y. Shaposhnik, S. Wang, and T. Wang. An interpretable model with globally consistent explanations for credit risk. arXiv preprint arXiv:1811.12615, 2018.
 [10] G. F. Cooper, V. Abraham, C. F. Aliferis, J. M. Aronis, B. G. Buchanan, R. Caruana, M. J. Fine, J. E. Janosky, G. Livingston, T. Mitchell, et al. Predicting dire outcomes of patients with community acquired pneumonia. Journal of biomedical informatics, 38(5):347–366, 2005.

[11]
Council of European Union.
Council regulation (EU) no 2016/679, 2016.
https://eurlex.europa.eu/legalcontent/EN/TXT/?qid=1528874672298&uri=CELEX%3A32016R0679.  [12] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.
 [13] F. DoshiVelez and B. Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
 [14] A. Fisher, C. Rudin, and F. Dominici. Model class reliance: Variable importance measures for any machine learning model class, from the” rashomon” perspective. arXiv preprint arXiv:1801.01489, 2018.
 [15] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232, 2001.
 [16] J. H. Friedman, B. E. Popescu, et al. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954, 2008.
 [17] A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65, 2015.
 [18] B. M. Greenwell, B. C. Boehmke, and A. J. McCarthy. A simple and effective modelbased variable importance measure. arXiv preprint arXiv:1805.04755, 2018.
 [19] H2O.ai. Python interface for H2o3, Mar 2019. 3.10.08.
 [20] R. Haygood. sklearngbmi, Jan 2017.
 [21] F. Hohman, A. Head, R. Caruana, R. DeLine, and S. M. Drucker. Gamut: A design probe to understand how data scientists understand machine learning models. CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), to appear, 2019.

[22]
F. M. Hohman, M. Kahng, R. Pienta, and D. H. Chau.
Visual analytics in deep learning: An interrogative survey for the next frontiers.
IEEE Transactions on Visualization and Computer Graphics, 2018.  [23] G. Hooker. Discovering additive structure in black box functions. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 575–580. ACM, 2004.
 [24] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & Engineering, 9(3):90–95, 2007. doi: 10 . 1109/MCSE . 2007 . 55
 [25] M. Kahng, P. Y. Andrews, A. Kalro, and D. H. P. Chau. Activis: Visual exploration of industryscale deep neural network models. IEEE transactions on visualization and computer graphics, 24(1):88–97, 2018.
 [26] M. Kahng, D. Fang, and D. H. P. Chau. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on HumanIntheLoop Data Analytics, p. 1. ACM, 2016.
 [27] B. Kim, R. Khanna, and O. O. Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems, pp. 2280–2288, 2016.
 [28] J. Krause, A. Dasgupta, J. Swartz, Y. Aphinyanaphongs, and E. Bertini. A workflow for visual diagnostics of binary classifiers using instancelevel explanations. In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 162–172. IEEE, 2017.
 [29] J. Krause, A. Perer, and E. Bertini. Using visual analytics to interpret predictive machine learning models. arXiv preprint arXiv:1606.05685, 2016.
 [30] J. Krause, A. Perer, and E. Bertini. A user study on the effect of aggregating explanations for interpreting machine learning models. 2018.
 [31] J. Krause, A. Perer, and K. Ng. Interacting with predictions: Visual inspection of blackbox machine learning models. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5686–5697. ACM, 2016.
 [32] T. Laugel, M.J. Lesot, C. Marsala, X. Renard, and M. Detyniecki. Inverse classification for comparisonbased interpretability in machine learning. arXiv preprint arXiv:1712.08443, 2017.
 [33] B. Letham, C. Rudin, T. H. McCormick, D. Madigan, et al. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350–1371, 2015.
 [34] Y. Lou, R. Caruana, J. Gehrke, and G. Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–631. ACM, 2013.
 [35] S. M. Lundberg, G. G. Erion, and S.I. Lee. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
 [36] A. Lunzer and A. McNamara. Exploring histograms. http://tinlizzie.org/histograms/, 2017. Accessed: 20190329.
 [37] M. Mannino and A. Abouzied. Qetch: Time series querying with expressive sketches. In Proceedings of the 2018 International Conference on Management of Data, pp. 1741–1744. ACM, 2018.
 [38] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. ACM, 2013.
 [39] T. Miller. Explanation in artificial intelligence: insights from the social sciences. arXiv preprint arXiv:1706.07269, 2017.
 [40] Y. Ming, H. Qu, and E. Bertini. Rulematrix: Visualizing and understanding classifiers with rules. IEEE transactions on visualization and computer graphics, 25(1):342–352, 2019.
 [41] C. Molnar. Interpretable Machine Learning. 2019. https://christophm.github.io/interpretablemlbook/.
 [42] M. Narayanan, E. Chen, J. He, B. Kim, S. Gershman, and F. DoshiVelez. How do humans understand explanations from machine learning systems? an evaluation of the humaninterpretability of explanation. arXiv preprint arXiv:1802.00682, 2018.
 [43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [44] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. ACM, 2016.
 [45] M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: Highprecision modelagnostic explanations. In AAAI Conference on Artificial Intelligence, 2018.
 [46] C. Rudin. Please stop explaining black box models for high stakes decisions. arXiv preprint arXiv:1811.10154, 2018.
 [47] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.
 [48] E. Štrumbelj and I. Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 41(3):647–665, 2014.
 [49] P. Tamagnini, J. Krause, A. Dasgupta, and E. Bertini. Interpreting blackbox classifiers using instancelevel visual explanations. In Proceedings of the 2nd Workshop on HumanIntheLoop Data Analytics, p. 6. ACM, 2017.
 [50] B. Ustun and C. Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3):349–391, Mar 2016. doi: 10 . 1007/s1099401555286
 [51] J. VanderPlas, B. Granger, J. Heer, D. Moritz, K. Wongsuphasawat, A. Satyanarayan, E. Lees, I. Timofeev, B. Welsh, and S. Sievert. Altair: Interactive statistical visualizations for python. Journal of Open Source Software, dec 2018. doi: 10 . 21105/joss . 01057
 [52] J. H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236–244, 1963.
 [53] M. Wattenberg, F. Viégas, and I. Johnson. How to use tsne effectively. Distill, 2016. doi: 10 . 23915/distill . 00002
 [54] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In ACM Sigmod Record, vol. 25, pp. 103–114. ACM, 1996.
Comments
There are no comments yet.