VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees

by   Angelos Chatzimparmpas, et al.

Bagging and boosting are two popular ensemble methods in machine learning (ML) that produce many individual decision trees. Due to the inherent ensemble characteristic of these methods, they typically outperform single decision trees or other ML models in predictive performance. However, numerous decision paths are generated for each decision tree, increasing the overall complexity of the model and hindering its use in domains that require trustworthy and explainable decisions, such as finance, social care, and health care. Thus, the interpretability of bagging and boosting algorithms, such as random forests and adaptive boosting, reduces as the number of decisions rises. In this paper, we propose a visual analytics tool that aims to assist users in extracting decisions from such ML models via a thorough visual inspection workflow that includes selecting a set of robust and diverse models (originating from different ensemble learning algorithms), choosing important features according to their global contribution, and deciding which decisions are essential for global explanation (or locally, for specific cases). The outcome is a final decision based on the class agreement of several models and the explored manual decisions exported by users. Finally, we evaluate the applicability and effectiveness of VisRuler via a use case, a usage scenario, and a user study.



page 1

page 5

page 6

page 9

page 10


Tree-Structured Boosting: Connections Between Gradient Boosted Stumps and Full Decision Trees

Additive models, such as produced by gradient boosting, and full interac...

What Can I Do Now? Guiding Users in a World of Automated Decisions

More and more processes governing our lives use in some part an automati...

StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics

In machine learning (ML), ensemble methods such as bagging, boosting, an...

Boosting insights in insurance tariff plans with tree-based machine learning

Pricing actuaries typically stay within the framework of generalized lin...

XtracTree for Regulator Validation of Bagging Methods Used in Retail Banking

Bootstrap aggregation, known as bagging, is one of the most popular ense...

FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches

The machine learning (ML) life cycle involves a series of iterative step...

Ensemble Methods of Classification for Power Systems Security Assessment

One of the most promising approaches for complex technical systems analy...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ensemble learning (EL) [Zhou2009Ensemble] is a well-established area of machine learning (ML) that strives for better performance by merging the predictions from various ML models. Three prominent methods for building ensembles are [Sagi2018Ensemble]: bagging [Breiman1996Stacked], boosting [Freund1996Experiments, Schapire1990Strength], and stacking [Wolpert1992Stacked]. Bagging requires training many decision trees on separate groups of instances of a data set and taking the average of their predictions [Breiman1996Stacked]

. Boosting includes attaching weak classifiers (e.g., decision stumps or shallow decision trees) sequentially, each improving the predictions made by the previous models 

[Freund1996Experiments, Schapire1990Strength]. Stacking involves fitting many base models from different algorithms on the same data set and using a metamodel to combine their results [Wolpert1992Stacked]. The common ground between bagging and boosting methods is that they incorporate ML algorithms that produce numerous decision trees [Kingsford2008Decision], such as random forests (RF) [Breiman2001Random] and adaptive boosting (AB) [Freund1999A], respectively. The decision paths stemming from bagged or boosted decision trees are the target of the visual analytics (VA) approach proposed in this paper.

The popularity of RF and AB is confirmed by their success in solving typical supervised classification problems, which constitute the majority of problems in the real world [Opitz1999Popular, Wyner2017Explaining]. An in-depth study [Delgado2014Do]

that estimates the performances of 179 algorithms of various types 


concludes that bagged decision trees of RF are better than other (types of) algorithms, such as deep learning approaches. Despite their remarkable predictive power, a crucial concern for algorithms that generate many decision trees is

interpretability. Brieman [Breiman2001Statistical], for instance, indicates that RF models, while superb predictors, receive a low rating regarding their interpretability. As ML models can provide incorrect predictions [Caruana2015Intelligible], ML experts have to check whether the model functions properly [Tam2017An]. Also, domain experts in critical fields need to understand how a specific prediction has been reached in order to trust in ML [Zhou20182D]. For example, in medicine, a physician might not rely on a model without explanations of how and why it forms a prediction, since patient lives are at risk [Ribeiro2016Why, Hastie2001The, Lakkaraju2016Interpretable]. Or, in the financial domain, declined decisions for loan applicants require additional transparency with the precise justification of the outcome [Sachan2020An]. Thus, one research question that remains open is: (RQ1) How do bagged decision trees’ learned rules differ from boosted decision trees, and is there any potential benefit in combining them, regarding interpretability and predictive performance?

The interpretation of ML models typically happens either at a global or a local level [Kopitar2019Local]. Global approaches intend to explain the ML model as a whole [Lipton2018The], assisting domain experts in exploring the general impact of each decision and gaining confidence in the produced predictions. On the other hand, local approaches aim to provide case-based reasoning [Du2019Techniques, Carvalho2019Machine], allowing domain experts to review a prediction and trace its decision path in order to conclude if the decision rule, and consequently the prediction, is trustworthy [Weller2019Transparency]

. Nevertheless, comparing numerous alternative decision paths without the support of an intelligent system is a time-consuming and resource-heavy procedure. For example, to scan the list of test instances rapidly and investigate specific instances of interest from multiple perspectives (e.g., outliers and borderline cases) can be crucial 

[Kim2014The]. One research question that arises from these explanations—inspired by Streeb et al. [Streeb2021Task]—is: (RQ2) How can visualizations and VA tools/systems facilitate the externalization of domain knowledge?

In this paper, we present VisRuler (see [), a VA tool that addresses the research questions described above by supporting the exploratory combination of decisions from two closely-related ML algorithms (i.e., RF and AB). VisRuler uses validation metrics for picking performant and diverse models and combines the decision paths from bagged and boosted trees to extract insightful and interpretable rules. Our contributions consist of the following:

  • a visual analytic workflow for defining a methodical way of evaluating decisions (cf. Figure 1 described in Section 4);

  • a prototype VA tool, called VisRuler, that applies the suggested workflow with coordinated views that support the joint effort between ML experts and domain experts for extracting rules and making decisions, respectively;

  • a use case and a usage scenario, applying real-world data, that validate the effectiveness of utilizing both bagged and boosted decision trees at the same time; and

  • a user study that showed promising results.

2 Related Work

According to a recent survey [Streeb2021Task] that has extensively analyzed tree- and rule-based classification, several VA systems have been developed for this topic in the InfoVis and VA communities. However, most of these tools do not employ algorithms and measures (except for the accuracy metric) in order to compare model quality [Streeb2021Task]. This section reviews prior work on the interpretation of bagged and boosted decision trees and the more general tools for tree- and rule-based visualization, comparing them with VisRuler to highlight our tool’s novelty.

Interpretation of Bagged Decision Trees.

As in VisRuler, relevant works that utilize bagging methods use the RF algorithm to produce decision trees [Neto2021Explainable, Nsch2019Colorful, Zhao2019iForest, Neto2021Multivariate]. iForest [Zhao2019iForest] provides users with tree-related information and an overview of the involved decision paths for case-based reasoning, with the goal of revealing the model’s working internals. However, iForest can be used only for binary classification, while VisRuler can be used with multi-class data sets (as in the use case of Section 4). Also, the feature flow, a node-link diagram, suffers from scalability issues (a challenge only partially overcome with aggregation). Our tool’s approach with dimension reduction employed for clustering all decisions extracted by multiple models enables users to gain insights into the relation of large quantities of rules. Therefore, VisRuler allows users to mine rules for both a particular class outcome and in connection to a specific case. ExMatrix [Neto2021Explainable] is another VA tool for RF interpretation that operates using a matrix-like visual representation, facilitating the analysis of a model and connecting rules to classification results. While the scalability is good, it does not cover the task of finding similarities between decisions from diverse models and algorithms. In conclusion, none of the above works have experimented with the fusion of bagged and boosted decision trees, and in particular, with visualizing both tree types in a joint decision space to observe their dissimilarity, which can result in unique and undiscovered decisions.

Interpretation of Boosted Decision Trees.

Special attention has been given to boosted decision trees with VA tools for diagnosing the training process of boosting methods [Liu2018Visual, Huang2019GBRTVis, Wang2021Investigating] and interpreting their decisions [Xia2021GBMVis]. Closer to our work, GBMVis [Xia2021GBMVis]

aims to reveal the structure and properties of Gradient boosting 

[Friedman2001Greedy], enabling users to examine the importance of features and follow the data flow for different decisions. A node-link diagram may limit its scalability to monitor hundreds or thousands of decisions concurrently, as opposed to VisRuler. Furthermore, our novel parallel coordinates plot adaptation allows users to instantly combine rules and observe their differences to identify unique decisions. BOOSTVis [Liu2018Visual]

employs views such as a temporal confusion matrix visualization for verifying the performance changes of the model, a t-SNE 

[vanDerMaaten2008Visualizing] projection for inspecting the instances, and a node-link diagram for examining the rules. Through GBRTVis [Huang2019GBRTVis], users can explore Gradient boosting [Friedman2001Greedy]

with a node-link diagram for the rules, the instances distribution shown in a treemap, and continuously monitoring the loss function.

VISTB [Wang2021Investigating] contains a redesigned temporal confusion matrix to track the per-instance prediction during the training process. It also enables the comparison of the impact of individual features over iterations. These VA systems focus on the online training of boosting methods and aim to assist in feature selection and hyperparameter tuning. While these problems are (partially) tackled by our tool, we concentrate on interpreting the decisions from bagged and boosted decision trees and comparing them across models.

Tree- and Rule-based Model Visualization.

Existing work on single decision tree visualization has experimented with different visualization techniques, such as node-link diagrams [Elzen2011BaobabView, Nguyen2000A, Lee2016An, Cavallo2019Clustrophile, Barlow2001Case, Phillips2017FFTrees, Bremm2011Interactive, Hongzhi2004Multiple, Munzner2003TreeJuxtaposer, Behrisch2014Feedback], treemaps [Muhlbacher2018TreePOD, Gomez2013Visualizing], icicle plots [Padua2014Interactive, Ankerst2000Towards], star coordinates [Teoh2003Starclass, Teoh2003PaintingClass], and 2D scatter-plot matrices [Do2007Towards]. These techniques do not generalize well when exploring multiple decision trees, which is VisRuler’s primary design goal. Visualizing the surrogate models to approximate the behaviors of the original models, either globally or locally, is another branch of related works [Cao2020DRIL, Castro2019Surrogate, Han2000RuleViz, Agus2021RISSAD, Ware2001Interactive, Yuan2021An, Eisemann2014A]

. Rule-based visualizations have also been deployed for the interpretation of complex neural networks 

[Marcilio2021ExplorerTree, Ming2019RuleMatrix, Jia2020Visualizing]. Nevertheless, these models differ due to the lack of inherent decisions that could be extracted directly from the bagged and boosted decision trees. The core mechanism of bagging and boosting methods is the generation of decisions based on the training data, which then experts can interpret.

Finally, several VA tools have been developed for specific domains of research, such as medicine [Hummelen2010Deep, Viros2008Improving, Niemman2014Learning, Carlson2008Phylogenetic], biology [Abramov2019RuleVis, Sydow2014Structure], security [Aupetit2016Visualization], and social sciences [Moussaid2013Social]. However, VisRuler is a model-agnostic solution that could be modified to work with various domains, depending on the given data set and the domain expert.

3 Target Groups and Design Goals

In the InfoVis/VA communities, most of the research in explainable ML focuses on assisting ML experts and developers in understanding, debugging, refining, and comparing ML models [Chatzimparmpas2020A, Chatzimparmpas2020The]. In this paper, we expand our method to involve another target group: the various domain experts affected by the ML progress in fields such as finance, social care, and health care. With the growing adoption of ML in different areas, domain experts with little knowledge of ML algorithms might still want (or be required) to use them to assist in their decision-making. On the one hand, their trust in such decisions could be low due to a lack of in-depth knowledge on how models are learning from the training data. On the other hand, ML experts often have little prior knowledge about the data from particular domains. Thus, the primary goal of VisRuler is to combine the best of both worlds, i.e., to offer a solution that combines the benefits from both expert groups.

Our design goals (G1–G5) originate from the analysis of the related work in Section 2, especially the three design goals from Zhao et al. [Zhao2019iForest] and the four questions from Ming et al. [Ming2019RuleMatrix]. Also, our experience from the development of VA tools [Chatzimparmpas2021StackGenVis, VisEvol2021VisEvol] for constructing powerful and diverse ML ensembles played a vital role. The implementation of the following design goals is described in Section 4.

G1: Comparison of performance and architecture of models for selecting the most effective ones. The comparison between models should be supported with various measurements, as follows: (1) illustrate the performance of each model based on multiple validation metrics; (2) distill the number of false-positive and false-negative instances from the confusion matrix for every model; and (3) derive the number of decision trees and decision paths (or simply decisions) per model, to compare their structure.

G2: Investigation of the contribution of global features according to different models and algorithms. Following the preceding goal, users should be guided through the process of selecting important features. Thus, it is crucial to enable the comparison between per-algorithm and per-model feature importances.

G3: Exploration of alternative clusters of decisions for global explanation and case-based reasoning. The summarization of the decisions in a single view that combines the decisions of different algorithms and models should be accomplished to allow users to assess the influence of each decision. For example, some decisions could overfit, and others could contain a mixture of instances falling in different classes. This last phenomenon increases their impurity. Users should be able to interact and explore this decision space.

G4: Comparison of decision rules based on local feature ranking. The global features described in G2 might not be similarly important for specific decisions, hence, local feature ranking via contrastive analysis [Zou2013Contrastive] could shed some light upon this task. Moreover, the interpretation of rules extracted from the space of solutions (see G3) could be achieved if users are capable of investigating the values of both training and testing instances.

G5: Identification of the different types of failure cases and confrontation via manual decisions. Failure to converge to a certain result due to the disagreement of the ML models should be highlighted to users. For instance, if there is no uniformity in the final decision or the majority voted for the wrong result, then it could be that these instances are outliers, borderline cases, or completely misclassified; being able to explore such cases easily is essential.

4 VisRuler: System Overview and Use Case

To accomplish the aforementioned design goals, we have developed VisRuler. The backend of our VA tool was built using Python, Scikit-learn [Pedregosa2011Scikit], and Flask [Flask]. As for the frontend, we utilize JavaScript, Vue [vuejs], D3 [D3], and Plotly.js [plotly].

The tool consists of five main interactive visualization panels ([): (a) models overview (G1), (b) global feature ranking (G2), (c) decisions space (G3), (d) manual decisions (G4), and (e) decisions evaluation (G5). Our proposed workflow is a two-party system with the ML expert on the one side and the domain expert on the other (see Figure 1). The above-mentioned panels of our tool support the experts’ collaborative effort, specifically: (i) the ML expert should select powerful and diverse models from the two separate algorithms based on their performance assessed by validation metrics ([(a)); (ii) during this phase, the ML expert should choose which features are important for the active models compared to all models (see [(b)); (iii) in the next exploration phase, both experts should examine which decisions explain the data set globally and decide upon impactful decisions for a specific test instance (cf. [(c)); (iv) in this same phase, the domain expert should interpret the manual decisions selected in order to gain insights about the models’ decisions—either globally or locally—for a particular test instance ([(d)); and (v) in the final phase, the domain expert can evaluate the agreement and extract suitable manual decisions while the ML expert should search for new models if the search did not reach a satisfactory level according to the domain expert ([(e)). Overall, this is an iterative process with a final goal to receive insightful decisions that should be interpretable for all counterparts. Details about the different views within the panels can be found below.

The workflow of VisRuler is model-agnostic as far as rules can be extracted from the deployed ML algorithms. Currently, the implementation uses two rather popular EL methods: (1) RF and (2) AB (cf. green and blue colors in Figure 1, respectively). This choice was made intentionally because bagging methods work differently than boosting, as explained in Section 1. Furthermore, each data set is split in a stratified fashion (i.e., keeping the class balance in training/testing split) into 90% of training samples, and the remaining 10% becomes the test set. We also validate our results with cross-validation using 3-folds on the training set, and we scan the hyperparameter space for 10 iterations using Random search [Bergstra2012Random] in each algorithm separately. The common hyperparameters for both ML algorithms we experimented with (and their intervals) are: number of trees/estimators (2–20), maximum depth of a tree (10–25), and minimum samples in each leaf of a tree (1–10). An extra hyperparameter of RF is the maximum number of features to consider when looking for the best split (()–()). AB has the learning rate (0.1–0.4).

In the following subsections, we explain VisRuler by describing a use case with the World Happiness Report 2019 [Helliwell2019World] data set obtained from the Kaggle repository [Kaggle2019]. This data set contains 156 countries (i.e., instances) ranked according to an index representing how happy the citizens of each country are. The six other variables that could be considered as features are: (1) GDP per capita, (2) social support, (3) healthy life expectancy, (4) freedom to make life choices, (5) generosity, and (6) corruption perception. Because this data set does not contain any categorical class labels, we follow the same approach as in Neto and Paulovich [Neto2021Multivariate] to discretize the happiness score in three different bins. Hence, we are converting this regression problem into a multi-class classification problem [Salman2012Regression]. Also in our case, the original variable Score becomes the target variable that our ML models should predict. In detail, the HS-Level-3 class contains 42 countries with happiness scores (HS) ranging from to , the HS-Level-2 groups 79 countries from to , and the HS-Level-1 class encloses 35 countries from to .

Figure 1: The VisRuler workflow allows the ML expert to select performant and diverse models, choose important features, investigate hyperparameters, and retrain models. The domain expert can explore robust decisions, compare them to global standards, identify local decisions for a specific test instance, and extract them.
Figure 2: Exploration of ML models with VisRuler. View (a) presents the deactivation of all models except for RF8, RF10, and AB10, after careful consideration of their performance based on plentiful metrics displayed in the visualizations. If we look at (b), Generosity is the least important feature for the three active ML models, and particularly, its importance decreased while we deactivated most of the available ML models (see brown color). (c) indicates that after the retraining with 5 out of the 6 original features, the new AB8 is better than the subsequent models due to the decline in recall; AB8, RF9, and RF10 remain the only active models after this step. In the box plot in view (d), H life exp becomes the most important feature by far than GDP per cap. Thus, these features swapped places compared to view (b).
Figure 3: Examining several pure global decisions from the active AB model. In (a), we select step-by-step three clusters of 12 identical decisions each. Note that this screenshot is composed of the Decisions Space (DS) view and the settings for the same view plus the settings for the Manual Decisions (MD) view. The decisions for \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C1}}}} classify training instances only for HS-Level-3 class (as depicted in (b)). Similarly, \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C2}}}} contains decisions for HS-Level-2 (visible in (c)), while \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C3}}}} for the remaining class, as shown in (d). The 7 test instance, which is currently under investigation, cannot be classified by those prior decisions. However, it most likely belongs in the medium- or the high-level class.
Figure 4: An outlier case exploration, the final prediction, and the training of another bunch of RF and AB models. (a) presents the anchoring of a cluster of 8 HS-Level-2 decisions to compare the overlapping rules against 3 HS-Level-3 decisions. In (b), after checking the common regions of agreement for the two clusters, we conclude that Perc of cor and H life exp are relatively low for the 15 test instance to belong in HS-Level-3 class. However, the other values for the remaining features are arguably rather high. In (c), we observe that all models voted for the average class while only the 3 selected manual decisions are supporting this case to be categorized as HS-Level-3 country. (d) showcases a potential search for new models by setting constraints in the hyperparameters according to the knowledge acquired from the initial training.

4.1 Models Overview

The exploration starts with an overview of how 10 RF and 10 AB models performed based on three validation metrics: accuracy, precision, and recall. The models are initially sorted according to the overall score, which is the average sum of the three metrics. Green is used for the RF algorithm, while blue is for AB. All visual representations share the same x-axis: the identification (ID) number of each model. The line chart in 

[(a) presents the worst to best models from left to right. The y-axis denotes the score for each metric as a percentage, with distinct symbols used for the different metrics. The Sankey diagram in [(a) visually maps a confusion matrix of only false-positive and false-negative values for each model, divided into two groups reflecting the two algorithms. It presents the confusion compared to all individual classes, as illustrated in both [(a) and  Figure 2(a). The height of the lines indicates the increase or decrease in confusion from one model to the other sequentially, so the smaller the height of a line, the better a model’s prediction compared to the predecessor or successor. The same effect applies to each node that absorbs the lines. The bar charts in [(a) showcase the two main architectural components of the bagged and boosted decisions trees, which are the number of trees/estimators hyperparameter and the number of decisions generated from these trees for every model mapped in the y-axes, respectively. These visualizations allow users to check the structure of the individual models in a juxtaposed manner since the number of decisions is related to the number of trees and the maximum allowed depth of each tree (i.e., max_depth hyperparameter). Finally, the state shown in [(a) designates which models are currently active (green or blue, respectively). In order to enable the comparison between the currently active model against all models, each icon for an active model contains a brown-colored slider thumb ([(a), including the legend in the top-right corner).

In our use case, we observe that models with ID number 8 and above slightly outperform the rest; notably, recall in AB7 is much lower than AB8 and beyond (cf. Figure 2(a), line chart). While RF models perform consistently better than AB models, as shown in both the line chart and the Sankey diagram of Figure 2(a), there is an improvement in the score of AB10. Therefore, we decide to keep only this model. Furthermore, since RF8 is more reliable in training instances for the HS-Level-2 class due to false-positives being lower than the equivalent for RF9 and RF10 (Figure 2(a), Sankey diagram), we keep this model and RF10, i.e., the top-performing model of the RF algorithm. In consequence, RF8, RF10, and AB10 are active models after selecting the corresponding states.

4.2 Global Feature Ranking

The box plots which aggregate per-algorithm importance (see [(b)) provide a holistic view of the performance of the models. Each pair of boxes is related to a unique feature, summarizing the active models’ normalized importance per feature (from 0 to 1, i.e., worst to best). The box plots are sorted according to the average values of all active models, which is visible as a number in teal. The difference to all models being active is evident with arrows facing up for increase or down for decrease in per-feature importance.

At this point, we want to investigate which features of the training set impacted the predictions more (see Figure 2). Interestingly, GDP per cap, H life exp, and Social sup are the top three features in the general ranking, as in [Neto2021Multivariate]. A surprising outcome is that, although two of the features mentioned above are still the most important for the selected RF models (all except Social sup), this is not true for the AB model. As seen in Figure 2(b), Social sup, Perc of cor, and GDP per cap are vital features for the AB algorithm in general. This pattern supports our hypothesis that different algorithms might take into account alternative features and should be combined to provide a holistic view. On the contrary, Generosity is unimportant for all models, specifically for the active models, since there is a decrease in importance. Thus, we choose to remove this feature and retrain without it (cf. Figure 2(b)). For the RF algorithm (green), we pick the most performant models based on the overall score (Figure 2(c)), rightmost models). However, AB8 is better overall than the subsequent AB models due to the stable and high recall value (Figure 2(c), line chart). In a one-to-one comparison between RF9 and AB9 with the bar charts, we recognize that while they have the same number of estimators (i.e., 17 trees), the two models produce 555 and 238 decisions, respectively. In this case, bagged decision trees allow a higher maximum depth than the equivalent boosted decision trees. After the selection of the new models, the most important features collectively are H life exp with and GDP per cap with , as illustrated in Figure 2(d); the opposite was valid in Figure 2(b). The new AB model considers the same features more important as the RF models. After this phase is over, AB8, RF9, and RF10 are the remaining three active models.

4.3 Decisions Space

The projection-based view in [(c) is produced by using UMAP [McInnes2018UMAP] with variable n_neighbors hyperparameter and min_dist set to . In the visual embedding, decisions are clustered based on their similarity according to the ranges they comprise for each feature, as in [Zhao2019iForest]. To determine the optimal number of clusters to be visualized, DBSCAN [Ester1996A] is used to compute an estimated number of core clusters from the derived decisions for a data set, which is then used to tune the n_neighbors, with a minimum of 2 and a maximum of 100 neighbors (the aim is to have the same magnitude in both). The green color in the center of a point indicates that a decision is from RF, while blue is for AB. The outline color exposes the training instances’ class based on a decision’s prediction. The size maps the number of training instances that are classified by a specific decision, and the opacity encodes the impurity of each decision. Low impurity (with only a few training instances from other classes) makes the points more opaque. The positioning of the points can be useful to observe if both RF and AB models produced similar rules, offering a comparison between algorithm decisions. The histogram in [(c) shows the number of decisions (y-axis) and the distribution of training instances in these paths (x-axis), and can also be used to filter the number of visible decisions to avoid overfitting rules containing only a few instances or general rules that might not apply in problematic cases.

Multiple interactions are possible in this view. The rounding slider (set to 15) allows users to round all decisions’ range values to the desired decimal points. The comparison mode (active in [(c)) enables users to anchor groups of points and compare the selection against any other cluster. The two alternative choices are to present either the overlap or difference between the handpicked groups; the Detach button is for canceling this mode. Density views assist users in observing the distribution of RF against AB decisions in the projection, which is helpful if large amounts of decisions are visualized (see Supplemental Figure S1). The Limit Decisions due to Test Instance checkbox alters the layout and changes global decisions’ exploration to local for a particular case. Finally, a limit can be set for the acceptable impurity that is visible. If a decision is more impure than the currently chosen value, then it becomes almost transparent. As this view is tightly connected with the visualization of the following view, we proceed directly to Section 4.4.

4.4 Manual Decisions

The vertical Parallel Coordinates Plot (PCP)-like view in [(d) illustrates the range values per feature for each selected decision (in this case, the comparison mode is active). The polylines represent the training instances and are color-encoded based on the ground truth (GT) class. There are two options here: either select to filter instances and show those that belong to the selected rules (see [(d)) or present all training instances at once (see Figure 3(b)–(d)). For example, in Figure 3(b), we see 12 identical rules that classify the training instances in the HS-Level-3 class (the red colored horizontal lines). The thick black polyline is the currently explorable test instance; users can compare it to the training instances that the models trained upon. All ranges for the features are normalized from 0.0 to 1.0. Scrolling is implemented when many decisions must be shown or the number of features is large. The order of the features is initially the global one, as described in Section 4.2. When a group of points is selected using the lasso tool in the Decisions Space, a contrastive analysis [Zou2013Contrastive] is used to rank the features and help the user to find out unique features that explain a cluster’s separation from the rest of the points. The computation works as follows: (1) break each feature into two disjoint distributions: the values inside the selected group vs. all the rest of the points; (2) discretize the two distributions of each feature into bins based on the Local Feature Ranking - Bins value set by the user (default is 10); (3) compute the cross-entropy [Mannor2005The] between the two distributions of each feature: higher values of cross-entropy suggest more unique features (i.e. the within-selection distribution is very different than the rest), while lower values suggest more common, shared features; and (4) rank the features based on step 3, with the more unique features nearer the top.

To investigate the global decisions based on the AB8 model we set the impurity to 0, disable limiting decisions based on the current test instance, and hide the RF models (cf. Figure 3(a)). We notice from the size of the decisions that if we analyze three core clusters (\⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C1}}}}\⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C3}}}}) we can get a better understanding of global decisions (see Figure 3(a)). In Figure 3(b), all 140 training instances ( spread across the classes) are observable together with the 7 test instance, which is currently under investigation. From Figure 3(b), we see that Social sup and GDP per cap should be very high for test instances to belong to this class. In contrast, for test instances to be in the HS-Level-2 class, they need to have a low-to-average Social sup, and average GDP per cap and H life exp (Figure 3(c)). Low values in the features (1) Freedom, (2) GDP per cap, and (3) H life exp are common for the low score in happiness countries (see Figure 3(d)), as also identified by [Neto2021Multivariate]. Regarding Saudi Arabia (the 7 test instance), it does not appear to belong to any of those decisions, but it is far away from the values reported for the HS-Level-1 class. It has a very high GDP per cap to belong in the average class, but the Social sup is on the lower side. Despite that, GDP per cap is 1 out of the 2 most important features according to the analysis in Section 4.2. Our conclusion matches the fact that it was ranked in 28 place out of the 156 countries, thus, belonging to the list of 42 countries classified as HS-Level-3.

4.5 Decisions Evaluation

The panel in [(e) contains interactive views that help users find outliers, borderline cases, and misclassified cases in the test set. The first main view allows users to extract the manual decisions (MD) selected in the previous phase (see Section 4.4). It also guides users in concentrating on cases where the majority of the RF and AB models disagreed when compared to the GT, or for models that did not vote unanimously. Furthermore, it is possible to go through all test instances one by one. The class agreement between RF and AB models, MD, and the GT is demonstrated via a horizontal stacked bar chart. The colors encode the different classes, and the length of each bar is the number of decisions for (1) MD, (2) RF models, (3) AB models, and (4) the GT (the latter always fills the entire bar). The second main view targets users that want to train new models based on the Ov. Score (%) of each previously-trained model. The two separate standard PCPs present the active RF models in green and the active AB models in blue, respectively. The brown color is used for the inactive models in both visualizations.

Checking the cases where the majority of the models disagree with the GT, we stop in the 15 test instance. Figure 4(a) shows the decisions applicable for this unusual case. We use the comparison mode to select a pure cluster on the left to juxtapose it with decisions classifying countries as HS-Level-3 on the right. Anchoring these clusters of points shows us the overlap of value ranges for the different features, as depicted in Figure 4(b). 28 out of the 30 training instances are similar to this test instance and belong to the HS-Level-2 class. The ranking of the features indicates that Perc of cor and H life exp are two unique features for the selected points, with low values for the former and average values for the latter, as in [Neto2021Multivariate]. Furthermore, for the first four features, the overlap is narrow between the two selected clusters, indicating that this instance could be considered an outlier. Indeed, Figure 4(c) presents that 8 out of the 11 decisions consider this instance as HS-Level-2. All active models are wrongly predicting Trinidad and Tobago (i.e., the 15 test instance) as an average HS country. Interestingly, the 3 MD of the RF models classified this country as HS-Level-3.

From the analyses in the previous subsections and the overall score of the RF and AB models, we observe that the most performant models for RF consider only 2 features when splitting the nodes (i.e., max_features hyperparameter). The PCPs in Figure 4(d) enable us to scan the internal regions of the hyperparameters’ solution space for RF. As for AB, the learning_rate should be as low as possible for this specific data set, as seen in Figure 4(d). Also, by searching for models with high values for min_samples_leaf, AB models are created with complex decision trees compared to simple decision stumps, which seems to be an appropriate limitation of the hyperparameter space that could lead to better models. After all these constraints, we move the Search for New Models slider from 0 to 10 in Figure 4(d) to request 10 additional models for each algorithm with the hope of discovering more powerful ones.

5 Usage Scenario

Figure 5: The exploration of clusters of decision paths from both ML algorithms. View (a) presents the selection of three clusters of global decisions that classify multiple training instances, thus, avoiding unimportant paths that might overfit. (b) provides an in-depth analysis of the decisions rules affected by \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C1}}}}. In (c), Len_emp emerges as a unique feature that characterizes \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C2}}}} with values from approximately 0.4 to 1.0. Finally in (d), high values in P_st_cred and Ins_perc turn over the prediction of the applicant to reject, visible via the exploration of \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C3}}}}.

In this section, we describe a hypothetical usage scenario with a collaboration of a model developer (Amy, the ML expert) and a bank manager (Joe, the domain expert) who handles granting loans to customers. Joe wants to use VisRuler to improve the evaluation process of loan requests, so he asks Amy to use VisRuler to train ML models based on a data set collected over years of accepting or rejecting loans in the bank. The data set includes 1,000 instances/customers and 9 features/customer information, with 300 rejected (purple) and 700 accepted (orange) applications. This data set is, in reality, a pre-processed version [Zhao2019iForest, Neto2021Explainable] of German Credit Data from the UCI ML repository [Dua2017].

Exploration and Selection of Algorithms and Models. Following the workflow in Section 4, Amy loads the data set and checks the score of each model based on the three validation metrics ([(a)). For the AB algorithm, in blue, all models have a relatively low value for the recall metric, except for AB8. Also, AB7 performs very well for the Accepted class (orange), since the false-negative (FN) line reduces in height compared to all other models. Therefore, she decides to keep only AB7 and AB8. By looking at the Sankey diagram in [(a), Amy infers that RF4 and RF5 are the two models with low confusion, due to only 135 false-positive (FP) instances. She picks RF5 because it is the subsequent model from RF4, which means that the overall score is slightly higher. The top RF models on the right-hand side also caught her attention, with RF9 and RF10 being the best options. She thinks that either of them could do the job, as they appear redundant due to similar confusion and values in both the Sankey diagram and the line chart (cf. [(a)). The bar charts below—which highlight the difference in the architectures of these RF models—help her to choose: with only 7 decision trees and 589 decision paths (compared to 18 and 1,483), RF9 is simpler. She concludes that RF9’s simplicity will make Joe’s exploration of decisions more manageable at a later phase. Consequently, she deactivates RF10 and continues the feature contribution analysis with RF5, RF9, AB7, and AB8 models.

Examining the Global Contribution of Features. After this new selection of models, Amy observes in [(b) that most features (except for the last two) are more important now than in the initial state. Ins_perc and Val_sa_st importances drop only by 0.01, implying these features are stable. She suggests Joe to keep all features for now and explore the differences through the decision rules later on. Another interesting insight is that A_bal is the most important feature for the RF models, while the AB models prefer D_cred (see [(b)). This could indicate that mixing models’ decisions from different algorithms is beneficial.

Explanations through Global Decision Rules. Joe starts his exploration by examining the global decision rules that can help him make accurate decisions for specific cases in the future. He focuses on the 12 test instance, which is a customer application reviewed by a colleague, Silvia (cf. usage scenario by Neto and Paulovich [Neto2021Explainable]). First, he unchecks limiting the decisions due to the test instance, as illustrated in Figure 5(a). At this point, Amy identifies several decisions that classify only fewer than 20 customers; she thinks: “these are not so generic after all”. Indeed, the larger the number of instances classified by one rule, the more generic and important it is (if the impurity is low). Consequently, they decide to increase the lower boundary of decisions, filtering out 1,928 decisions (see Figure 5(a), bar chart). After the update, Joe focuses on the UMAP [McInnes2018UMAP] projection. He observes multiple groups of points that could be worthy of further investigation. He selects a couple of samples from different areas, e.g., \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C1}}}} with 3 RF and 18 AB decisions. Another cluster with 7 decisions is \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C2}}}} that solely predicts accepted loan applications. On the contrary, \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C3}}}} contains 2 pure decisions (due to high opacity) that produce rules which reject loans. Joe increases the discretization of local feature ranking from 10 to 15 bins to raise the sensitivity of difference between decision rule ranges, and he filters the instances due to the decisions to observe clearer trends. From Figure 5(b), Joe recognizes that \⃝raisebox{0.15pt}{\resizebox{}{3.44pt}{{{C1}}}} decisions are all identical, having the same ranges for every feature. Also, he understands that low credited amount (Cred_am) and short duration of credit (D_cred) are essential factors for accepting a loan application. Account balance is also vital because all loans are accepted when there is no account (A_bal being 0). Figure 5(c) reveals another intriguing pattern, that is, the length of current employment should be average to extremely high (from approximately 0.4 or 0.6 and above) for applications to get accepted. In contrast, Figure 5(d) presents that if payment status of previous credit (P_st_cred) and instalment per cent (Ins_perc) are relatively high, the applications were rejected. The 12 customer has an account without any balance, and the D_cred is relatively high, which flips the prediction toward rejection. Luckily, Silvia also provided an adequate justification to the customer [Neto2021Explainable].

Extracting Manual Decisions through Local Investigations. At this point Joe knows and understands the main decision rules, but a new customer arrives. Focusing on the decisions for this case (i.e., 90 test instance), he sets impurity to less than 0.3 (cf. [(c), slider) to make impure decisions more transparent. Two fairly pure decisions from RF5 (visible due to hovering) and RF9 contradict each other. Joe uses the comparison mode, anchors 1 out of the 2 decisions, and selects the other with the lasso tool. The comparison in [(d) designates that 8 similar customers’ applications were rejected while 12 were accepted. The small overlap in Cred_am, D_cred, and Age suggest that this is a borderline case. Cred_am seems a bit arbitrary for the training data since only a small amount of applications in-between accepted applications were rejected, see [(d), feature on top. However, a clear insight is that if D_cred was lower, the application should have been accepted, while the opposite effect is true if the duration of credit increases. Unexpectedly, RF models vote for accepting this loan application while AB models reject (cf. [(e), top view). Besides that, the manual decisions are also in-between the two classes, which further enhances Joe’s assumption that this is a borderline case. As AB models propose rejection and RF9 produces a decision for rejecting this application, he follows these recommendations. Nonetheless, Joe asks Amy to search and train new performant ML models (see next paragraph).

Tuning the Search for Bagged and Boosted Decision Trees. Amy sees two possibilities of improvement for the RF in [(e), bottom view. One is to limit the max_features to 7 and 8 because they produced the best models so far. The second strategy is to pick 3 and 4 for the same hyperparameter to explore an entirely new space of currently unexplored models. Basically, she believes it is better to try both strategies in two separate runs. As for the AB, she reasons that selecting 0.1 and 0.2 for the learning_rate

is a wise choice. Although it may take more time to retrain the AB models, they probably will be more powerful than with the other setting due to historical data. She performs the above actions, and finally, another cycle of exploration is unfolded for both experts.

6 User Study

We conducted a user study to evaluate our tool’s effectiveness in supporting decision-making based on many alternative decision paths. As in prior works [Ming2019RuleMatrix, Neto2021Explainable], we created five questions (Qs) that cover VisRuler’s different views, focusing on appraising the goals described in Section 3 with the use case outlined in Section 4 as the GT (see Supplemental Table S2).

Demographics and Instructions. 7 male and 5 female volunteers aged 23 to 49 (mean: 33) participated in our study, all with at least an MSc degree (and 2 PhD’s). None of them knew the data set used, and no colorblindness issues were reported. 4 of the participants were highly knowledgeable in visualization and 7 in ML, while the rest had limited knowledge regarding all aspects and 4 of them had never worked with any EL method. The initial step of the study was to watch an 18-minute video tutorial about bagging and boosting concepts, VisRuler’s goals, and how to work with our tool to analyze decision paths, using the Iris data set [Fisher1936The]. The participants experimented for five minutes with Iris, and then proceeded to use the data set described in Section 4. They were asked to answer five questions (cf. Supplemental Document S3) and provide qualitative feedback via the ICE-T questionnaire [wall2019aheuristic].

Question-related Results. After the initial setting shown in Figure 2(a), all participants decided to exclude Generosity in Q1, which happened in 2.03 minutes on average. For Q2, 9 participants followed our GT, as described in Figure 2(c). The remaining attendees selected AB10 instead of AB8. This action led to 5 test instances in conflict compared to 3 in our analysis (Figure 4(c) presents a single case). This result could be a strong indication that our approach is essential for making such decisions. To respond in Q2 and Q3, participants took 4.04 and 2.58 minutes on average, respectively. The most time-consuming question was Q4 with an average response time of 6.15 minutes (but with very accurate results, see Figure 3(b)). The average time taken for Q5 was 6.07 minutes, with only one wrong answer (Figure 4(b)).

Qualitative Results. In Table 1, the mean scores of each component of the ICE-T form [wall2019aheuristic]

for every participant are displayed along with the two-tailed 95% confidence intervals (CIs) per component (

). Higher values in green designate good results, as opposed to red. VisRuler has received a few 7.0 scores, and most are at least 6.0 and above (the lowest score is 4.67). Essence, Insight, and Time received a large score which means users found our tool competent in portraying decisions, guiding users to come up with fundamental questions, and performing these discoveries quickly. The Confidence was lower, with a mean value of 5.87. However, this value still makes VisRuler a reliable and trustworthy VA tool based on Wall et al. [wall2019aheuristic].

Table 1: Analyzed results from the ICE-T feedback [wall2019aheuristic].

7 Discussion and Conclusions

We presented VisRuler, a VA tool that allows users to explore diverse rules extracted from bagged and boosted decision trees to reach a consensus about a final decision for each individual case. The multiple coordinated views facilitate the selection of diverse and performant models, the characterization of per-feature contribution, the management of multiple decisions, the analysis of global decisions, and support case-based reasoning. Finally, we validated the usability and efficacy of VisRuler via a user study.

Limitations. Although VisRuler can visualize thousands of decision paths verified by the usage scenario in Section 5, the cluttering of the dimension reduction methods could be deemed as an intrinsic difficulty. Also, efficiency might be problematic if numerous models are simultaneously active and produce too many decisions. In such cases, the vertical PCP may be challenging to interpret because it requires users to scroll through a list of decisions that expands by the number of features. Another limitation is the extensive (but unavoidable) use of color that might hinder our tool from operating with more than a few classes. All these limitations indicate future directions for our work.


This work was partially supported through the ELLIIT environment for strategic research in Sweden.