In 2018, the Fair Isaac Corporation (FICO) proposed a challenge to data science researchers: present an explainable model for the risk of defaulting on a loan. FICO provided a dataset for the challenge, and asked researchers to provide explanations for the global model, as well as local explanations (explanations for a given prediction). They also requested that the global model respect monotonicity constraints (increasing or decreasing) on several of the variables.
In responding to this challenge, we considered the specific aspects of the dataset in our modeling approaches: the FICO data are balanced between the two classes, most of the features are real-valued, and most importantly, each of the 23 features is itself interpretable. Because the features already come with a good representation, the algorithm does not need to construct the representation. Generally, for data having this particular property, perhaps with a small amount of feature engineering, most machine learning algorithms tend to have almost the same performance, including algorithms that produce globally interpretable models. Thus, we aimed to create a model that was fully and globally interpretable, rather than to construct a black box.
We call our globally interpretable model a two-layer additive risk model
. It was designed to resemble traditional subscale models, where the features are partitioned into meaningful subgroups, and the subgroup scores are later combined into a global model. Traditional subscale models are generally interpretable because they are decomposable into meaningful components, and because these models are usually linear with coefficients whose sign is positive for risk factors. Our model preserves these classical elements (decomposable, uses linear modeling, positive coefficients for risk factors), but inserts (interpretable) nonlinearies in several places to make the model more flexible and accurate. In particular, the algorithm transforms the original features into piecewise constant functions that monotonically increase (or decrease) if we constrain them to do so. Combinations of these piecewise constant functions form the subscales in the second layer of the network. The subscales are fed through a sigmoid nonlinearity, which has the effect of adding more perceptron-like flexibility to the model, but also makes the subscale scores more meaningful as their own “mini-models;” each subscale produces its own probability of defaulting on a loan. The subscales are combined linearly and sent through a sigmoid function to produce the final probability of defaulting on a loan.
While working on this challenge, it is important to note that the guidelines asked for an explanation for each class of the global model, such as variable importance information. It did not necessarily ask for a model that is globally interpretable. Perhaps it could benefit FICO to have a global model that is not interpretable (a “secret sauce”)? It is not clear that even if there exists a globally interpretable model, such as the one we found, that it would be desirable for FICO to release it. Thus, we are not certain that we responded as much to FICO’s needs as we did to its customers’ needs.
The issue raised above, about explainability of a black box model, versus providing a globally interpretable model, is important. With the new General Data Protection Regulation (GDPR) (Regulation, 2016) regulations such as “right to an explanation,” there could still be little incentive for companies to provide anything more than a modestly local explanation, even if a globally interpretable model existed. However, explanations can be problematic for several reasons. First, unless the global model is uniformly equal to the local model, the explanations will be sometimes incorrect, which makes it difficult to trust either the explanations or the global model itself. Even if an explanation gives the same prediction as the global model, it could be inconsistent with the global model’s actual calculations (i.e., low fidelity (Guidotti et al., 2018b)), or it might provide reasons that may be true for some cases but not others. For instance, a reason of “too many accounts open” may be used to deny someone a loan even if there is another person with the same number of open accounts who was offered a loan by the same global model. Explanations could also be correct but misleading, offering reasons that are true but incomplete – missing key information. All of these problems with explanation-for-black-box methods are reasons that consumers would benefit from globally interpretable models. Again, however, we fully recognize that creating a globally interpretable model is not usually desirable from a business perspective.222Companies such as DivePlane are now grappling with this issue, where they do not, as of this writing, release code or demonstrations of their models, which are claimed not to be black boxes.
Attempts to create globally interpretable models for financial applications use mainly standard machine learning approaches (e.g., decision trees and support vector machines are used for bank direct marketing(Moro et al., 2011, 2014)), which cannot accommodate FICO’s monotonicity constraints. Some work finds optimal rules (Chen and Rudin, 2018), but rules are not natural for datasets with many real-valued features, like FICO’s data. Additive models are natural for real-valued features and can easily preserve monotonicity.
Even though our global model is interpretable and thus can be explained on its own, we can also produce optional local “explanations,” which now simply become summaries of general trends in the global model. These are summaries rather than explanations (or summary-explanations) in that they do not aim to reproduce the global model, only to show patterns in its predictions. Our summary-explanation method (called SetCoverExplanation) is a model-agnostic explanation algorithm (Shaposhnik and Rudin, 2018). SetCoverExplanation solves a minimal set cover optimization problem (Feige, 1998, 1996) to generate conjunctive rules that are consistent with all training cases.
To create an explanation of an individual prediction, our interactive display first highlights the factors that contribute most heavily to the final prediction of the global model. Second, we show patterns produced by SetCoverExplanation. Third, we provide case-based explanations. Our case-based reasoning method finds cases that are similar on important features to any current case that the user inputs.
We created an interactive display that shows the full computation of the model from beginning to end, without hiding any nonlinearities or computations from the user. Factors are colored according to their contribution to the global model. The form of our global model lends itself naturally to variable importance analysis, and understanding monotonicity constraints, through the visualization.
The novel elements of the work are (i) the form of the two-layer additive risk model, which lends naturally to sparsity, decomposibility, visualization, case-based reasoning, feature importance, and monotonicity constraints, (ii) the interactive visualization tool for the model, (iii) the use of the SetCoverExplanation algorithm for high-support local conjunctive explanations, and (iv) the application to finance, indicating that black boxes may not be necessary in the case of credit-risk assessment.
2 Two-Layer Additive Risk Model and its Visualization
We work with a dataset , where
is a vector of features, where the categorical features are binarized. The labels are indicators of defaulting on a loan:. Let represent the set of features and . We present the structure of our two-layer additive risk model (ARM). While in general, neural networks are hard to comprehend even with sparsity regularization applied, our model has carefully designed sparsity and monotonicity constraints that make the calculations easier to comprehend. Moreover, our model did not require quadratic terms, as (Lou et al., 2013) did.
First, to ensure monotonicity of the model with respect to any given features, we used step functions as our initial transformations of the features, and constrained the coefficients of each feature’s step functions to be non-negative. For instance, for a monotonically decreasing feature , the following features could be created: , , , and 333We handle missing values also by creating binary indicator features for missingness. Note that all of these features use one-sided intervals. This choice was made because convex combinations of these step functions yield monotonic piecewise linear functions. Specifically, by enforcing the constraints that the coefficients for , , and must be non-negative, we shall guarantee a monotonically decreasing relationship between the original continuous feature and the subscale’s predicted probability of default. Once the model is learned, the sum of the one-sided intervals becomes a piecewise constant function. For instance,
can be equivalently written as
If coefficients , , and , are nonnegative, the function is nonincreasing. If instead we would like to constrain to be monotonically increasing, we reverse the above one-sided inequalities into . Of course, if a feature has no desired monotonicity, we drop non-negativity constraints.
Using domain knowledge obtained from the data description, we partitioned the features into different sets for the subscales, inducing sparsity in the first layer of the network. Each subset of features is sent to one node which computes a subscale. Denote the feature subsets as
where subset of features is sent to a subscale. There are between 1 and 4 of the original features combined per subscale, yielding a total of 10 subscales to represent the original 23 features. Each subscale can be interpreted as a miniature model for predicting the probability of failure to repay a loan, using only the features designated for the subscale. The output of the subscale is a probability, denoted by for subscale , which is simply a sigmoid transformation of the score,
where represents a sigmoid function and is the number of binary features created for feature (not counting the special indicator for non-missing values).
Finally, the subscale results are linearly combined and again nonlinearly transformed into a final probability of failure to repay a loan. The contribution of each subscale to the final prediction can be easily observed by its weighted output.
The simplest way to train coefficients is to treat each as an independent classification model with the being the target variables, using regularization (e.g., ) to prevent overfitting, and positivity constraints on the thresholds to enforce monotonicity. A slightly more complicated way to train is to optimize for a combination of accuracy for the subscales and accuracy for the global model. On our data, these two methods tended to produce almost identical results. An image of the full model is shown in Figure 3, where the colors indicate the final contribution to the combined score. Red indicates more likely to default on the loan. The 23 feature values can be entered on the left, and clicking on any of the 10 subscales (in the second colored layer) reveals a pop-up window with the calculation, as shown in Figure 2 for two subscales. The final combination of features is shown in Figure 3 (right panel). Figure 4 shows that our global model does not lose accuracy over other machine learning techniques, despite being constrained to be interpretable. The accuracy results were obtained by averaging test accuracy figures of five random training-test splits. Our final model was trained on the entire dataset.
Our two-layer additive risk model comes naturally with a way to identify a list of factors that contribute most heavily to the final prediction. Table 1 shows a list of four factors that are important for predicting observation “Demo 1” to have risk of default (i.e., bad risk performance).
|Most important contributing factors|
|1||MaxDelq2PublicRecLast12M is 6 or less (from the most important subscale, Delinquency)|
|2||PercentTradesNeverDelq is 95 or less (from the most important subscale, Delinquency)|
|3||AverageMInFile is 48 or less (from the second most important subscale, TradeOpenTime)|
|4||AverageMInFile is 69 or less (from the second most important subscale, TradeOpenTime)|
To identify the factors, we first identify the most important two subscales and then the most important factors within each subscale. The importance of each subscale in the final model is determined by its weighted score, which is the product of the subscale’s output and its coefficient – the larger the product, the larger the contribution of the term in the final risk.
For example, for “Demo 1”, the two most important subscales are Delinquency (with points of 1.973) and TradeOpenTime (with points of 1.947). Then, within each of the two subscales, we find two factors that contribute the most to that particular subscale’s risk score. The two most important factors for each subscale are determined likewise by the product of the coefficient of each binary feature and the value of the binary feature itself. We finally output those binary features and their corresponding values as the most important contributing factors to the prediction made by our model.
The factors are grouped by subscales and are displayed in decreasing order of importance (within each important subscale) to the global model’s predictions.
3 Consistent rule-based explanations with SetCoverExplanation
As noted earlier, in addition to predicting risk using a globally interpretable model, we generate consistent rules that summarize broad patterns of the classifier with respect to the data. They do not explain the global model’s computations; instead they provide useful patterns, which has been a popular form of explanation in the state-of-the-art literature on model explanations(Lakkaraju et al., 2017; Guidotti et al., 2018a; Ribeiro et al., 2018). As an example, consider Observation 6 in the FICO dataset, for which the global model predicts a high risk of default. SetCoverExplanation returns the following rule-based summary-explanation that includes Observation 6:
For all 700 people where: ExternalRiskEstimate , and NetFractionRevolvingBurden, the global model predicts a high risk of default.
SetCoverExplanation asserts that our global model predicts high-risk for all of the 700 previous cases that satisfy these rules. Therefore these rules are globally consistent. In contrast, explanations (from other methods) that are not consistent may hold for one customer but not for another, which could eventually jeopardize trust.
In what follows, we formalize the discussion on consistent rules and put it into concrete mathematical terms. After defining rules in Section 3.1, we address aspects of optimization in Section 3.2. In Section 3.3 we describe how to use consistent rules to identify similar cases for case-based explanations.
3.1 Notation and definitions
Consider a -dimensional binary data set , that is, for every and . Let denote a classifier that was trained using the dataset, and let denote the vector of labels generated by the model (the super-script stands for model).
Assume is a subset of features and is a label. The rule describes the following binary function/classifier:
That is, for a given observation x, the rule predicts based on the projection of the observation onto the subspace of features ; specifically, by applying a logical AND operator on the subset of features in .
Let denote an observation and the respective model prediction that we wish to create a rule for. We say that the rule provides a consistent summary-explanation for if the following conditions are met:
(Relevance) for every .
(Consistency) For every observation for which for all , it must also hold that .
The second condition establishes consistency by enforcing all observations in the dataset to agree with the rule, in the sense that all observations
for which the binary variables inare true are similarly labeled by the global model as .
We measure the quality of a rule using two criteria:
Sparsity – the cardinally of , that is, . This captures to a certain degree the level of interpretability of the rule.
Support – the number of observations in the dataset that satisfy the rule, namely . This serves as a measure of coverage for the applicability of the rule.
Note that the fact that the above definitions apply only to rules where features are equal to 1 (and not 0) may seem to be a limitation. However, one can easily extend the feature space by adding binary features that are equal to the complements of the original features, that is, by adding . In this case, rules that contain complement features can be interpreted as rules where an original feature is equal to 0. In what follows, we assume that the design matrix X is of dimensions and includes both the original and complement matrices: .
We now describe the formulation of multiple algorithms to generate rules that achieve the objectives of high sparsity and support.
Let denote a binary decision variable that indicates whether . Denoting as the largest set of features that “agree” with observation allows us to write Condition 1 as . Therefore, we need only consider variables for which . In order for Condition 2 to hold, observations with labels different from must not satisfy the rule . That is, for each such observation , a feature must be selected for which . Each feature therefore covers a set of observations with opposite labels, and a feasible solution must cover all observations whose labels are different from . This is an instance of the Minimal Set Cover Problem (Feige, 1998, 1996).
More formally, let denote the (constant) set of observations that satisfy . Finding a rule with optimal sparsity is the solution to the following optimization problem:
We briefly note that we conducted a computational study on the FICO dataset where sparse explanations were generated for each of the 10K observations, based on all other observations. The running time consistently took less than 7 seconds, and the average sparsity was under 3 features.
Formulation (1) can be extended to incorporate support by adding a binary decision variable (and an appropriate linear constraint) for each observation that indicates that the rule applies to the respective observation. An additional constraint was added to limit the number of features to a predefined constant MAX_SPARSITY in the resulting rule.
We experimented with our formulations by generating summary-explanations on the FICO dataset. We first solved Formulation 1, and set its solution as the value of MAX_SPARSITY in the modified formulation that optimizes support. We then increased the value of MAX_SPARSITY by 1 and 2 to relax the respective constraint, in order to improve the support size (at the cost of worse sparsity).
We generated summary-explanations (where we maximized support) for all observations in the FICO dataset. The average number of features in each explanation did not change (comparing with the optimal sparsity) and was equal to 2.9 when MAX_SPARSITY was set to the solution of Formulation1; sparsity slightly worsened to 3.6 and 4.4 when MAX_SPARSITY was increased by 1 and 2, respectively. We also found that the number of summary-explanations for which the support is less than 10 was 9.7% of all observations for the solution given by Formulation 1, and was equal to 4.7%, 1.2%, and 0.2% of all observations for the maximal support solution when MAX_SPARSITY was increased by 0, 1, and 2, respectively. Clearly, this indicates a tradeoff between support and sparsity; when the constraint on MAX_SPARSITY is relaxed, the cardinality of the rule increases and the rule becomes less interpretable, however, at the same time support increases which improves the confidence in the resulting rule. Overall, the average cardinality is nicely low and the support is reasonably high.
Optimization procedure for the challenge.
In an actual deployed system, generating rules could be done offline for each user, since users’ credit information changes infrequently over time. In contrast, in our code, we wanted to allow the user the ability to interact with the system to see how explanations are generated for any possible new observation. Therefore, we wanted to limit the running time for generating explanations to provide a reasonably short response.
To this end, we first created a database of summary-explanations using the FICO dataset. We created summary-explanations for maximal sparsity, and maximal support subject to sparsity constraints of +0, +1, and +2 of optimal. We did the same for 10K additional random observations. When a user clicks to generate an explanation, the database is scanned and the sparsest rule whose support is greater than 10 is returned. Otherwise, if no rules were found, the maximal sparsity optimization problem is solved. If the support of the returned rule is greater than 10, that solution is returned. Otherwise, the maximal support optimization is solved, consecutively relaxing the constraint on the maximal sparsity. If rules with sufficiently large support were not found, the procedure is repeated with minimal support set to 5 (instead of 10). Beyond this, an error message is displayed if no summary-explanations were found, because the observation is an outlier; there is no rule characterizing it.
Figure 5 illustrates the tradeoff between simplicity and coverage for two rules generated by SetCoverExplanation. The rule on the left is 2-dimensional and the rule on the right involves 3 dimensions.
3.3 Case-based explanations
Given a previously unseen case, we can identify similar cases from the FICO dataset to assess the prediction made by our global model for the unseen case. To do this, we find all the cases in the dataset that satisfy the consistent rule-based explanation for the unseen case (obtained by SetCoverExplanation). We then rank these cases according to how many binary features (see Section 2 for how we obtain the binary features) they share with the unseen case, and present the five highest-ranked similar cases to the user. Figure 6 gives an example of a case-based explanation: our visualization contains a table showing the current case (top row) and previous cases that are most closely related to the current case (all other rows).
Future work: There are several possible extensions to our work. Methods like RiskSLIM (Ustun and Rudin, 2017) could make the subscale scores more interpretable by restricting to integer coefficients. Our visualization interface could be extended to be much fancier, like (Ming et al., 2018) for rule-list exploration.
Conclusion: Since the FICO dataset does not seem to require a black box for good performance, perhaps many other applications in finance also do not require a black box. The answer to this hypothesis remains unclear. However, challenges like the one initiated by FICO can help us to determine the answer to this important question.
Appendix: Login Information
The web interface for our system can be found at: http://dukedatasciencefico.cs.duke.edu
with username dukedatascience and password OxNaUTsSjH0GQ
Chen and Rudin 
Chaofan Chen and Cynthia Rudin.
An optimization approach to learning falling rule lists.
International Conference on Artificial Intelligence and Statistics (AISTATS), pages 604–612, 2018.
A threshold of ln n for approximating set cover (preliminary
Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 314–318. ACM, 1996.
- Feige  Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652, 1998.
- Guidotti et al. [2018a] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Dino Pedreschi, Franco Turini, and Fosca Giannotti. Local rule-based explanations of black box decision systems. arXiv preprint arXiv:1805.10820, 2018a.
- Guidotti et al. [2018b] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys (CSUR), 51(5):93, 2018b.
- Lakkaraju et al.  Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Interpretable & explorable approximations of black box models. arXiv preprint arXiv:1707.01154, 2017.
- Lou et al.  Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 623–631. ACM, 2013.
-  Medical calculators. Mdcalc - medical calculators, equations, algorithms, and scores. https://www.mdcalc.com, 2018. Accessed: 2018-11-12.
- Ming et al.  Y. Ming, H. Qu, and E. Bertini. Rulematrix: Visualizing and understanding classifiers with rules. IEEE Transactions on Visualization and Computer Graphics, 2018.
- Moro et al.  S. Moro, R. Laureano, and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In Proceedings of European Simulation and Modelling Conference (ESM’2011), pages 117–121. Eurosis, 2011.
- Moro et al.  Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014.
- Regulation  Protection Regulation. General data protection regulation. Official Journal of the European Union, 59:1–88, 2016.
- Ribeiro et al.  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Artificial Intelligence, 2018.
- Shaposhnik and Rudin  Yaron Shaposhnik and Cynthia Rudin. Globally-consistent rule-based summary-explanations for machine learning models, with application to credit-risk evaluation. Unpublished, 2018.
- Ustun and Rudin  Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3):349–391, 2016.
- Ustun and Rudin  Berk Ustun and Cynthia Rudin. Optimized risk scores. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017.