Log In Sign Up

Towards Explainable Deep Learning for Credit Lending: A Case Study

by   Ceena Modarres, et al.
Capital One
Columbia University

Deep learning adoption in the financial services industry has been limited due to a lack of model interpretability. However, several techniques have been proposed to explain predictions made by a neural network. We provide an initial investigation into these techniques for the assessment of credit risk with neural networks.


page 1

page 2

page 3

page 4


CreditPrint: Credit Investigation via Geographic Footprints by Deep Learning

Credit investigation is critical for financial services. Whereas, tradit...

Sequential Deep Learning for Credit Risk Monitoring with Tabular Financial Data

Machine learning plays an essential role in preventing financial losses ...

Explanations of Machine Learning predictions: a mandatory step for its application to Operational Processes

In the global economy, credit companies play a central role in economic ...

An Interpretable Neural Network for Parameter Inference

Adoption of deep neural networks in fields such as economics or finance ...

Hide-and-Seek: A Template for Explainable AI

Lack of transparency has been the Achilles heal of Neural Networks and t...

E.T.-RNN: Applying Deep Learning to Credit Loan Applications

In this paper we present a novel approach to credit scoring of retail cu...

Explaining Adverse Actions in Credit Decisions Using Shapley Decomposition

When a financial institution declines an application for credit, an adve...

1 Deep Learning for Credit Risk Assessment

Deep learning has transformed many areas of data science, but has experienced limited adoption in the financial industry, particularly for assessing credit worthiness. This not due to their performance, which has been shown to eclipse decision trees and logistic regression in credit risk related tasks, but rather a product of their black-box quality

angelini2008neural fei2015 stroie2013 . Both the Equal Credit Opportunity Act (ECOA), as implemented in Regulation B, and the Fair Credit Reporting ACT (FCRA) require lenders to provide reasons for denying a credit application adversea67 . Neural networks’ black-box quality can hinder a lender’s ability to assess model fairness, detect bias, and meet regulatory demands. Unlike linear models, where the learned feature weights indicate relative importance, there is no clear mechanism for assessing the reasons behind a neural network’s prediction doshi2017towards . Several attribution techniques have been proposed to explain neural network predictions, but have not significantly filtered into the financial industry NIPS2017_7062 frosst2017distilling . An attribution technique explains a given prediction by ranking the most important features used for generating that prediction. Three prominent techniques are LIME ribeiro2016should , DeepLift shrikumar2017learning , and Integrated Gradients sundararajan2017axiomatic . These methods are theoretically founded and provide human interpretable reasons for a prediction. However, they have not yet been studied or applied in the context of credit assessment. In order to evaluate the viability of attribution methods in credit lending, we ground our analysis in evaluating two important questions:

  1. Do these attribution methods generate accurate and interpretable explanations for a credit decision? (Trustworthiness)

  2. How consistently do attribution methods produce trustworthy explanations for a credit decision? (Reliability)

We explore the trustworthiness and reliability of LIME, DeepLift, and Integrated Gradients and propose two future research directions for the practical use of attribution methods in credit lending kindermans2017reliability adebayo2018local adebayo2018sanity .

2 Background: Explaining Credit Decisions

We consider FICO’s open-source Explainable Machine Learning dataset of Home Equity Line Credit applications. explaina31 . The dataset provides information on credit candidate applications and risk outcome () - characterized by 90-day payment delinquencies (

). To test the reliability and trustworthiness of attributions, we train a feed forward neural network to forecast credit risk (dataset and model details in


3 Experiment 1: Trustworthiness of Attributions for Credit Lending

Generating an Accepted Ground Truth

We train a logistic regression classifier to predict credit risk. We then measure feature importance through the learned feature weights. We treat the learned feature importance as the accepted ground truth explanation due to its wide application in credit risk modeling by practitioners, acceptance among regulatory bodies, and theoretical grounding

Creditsc67 . We train a logistic regression on the FICO dataset and use the learned global weights as the benchmark to compare explanations of neural network predictions.

Neural Network Explanations

We generate explanations for the neural network model using LIME, DeepLIFT, and Integrated Gradients. LIME produces local explanations by perturbing the input around a neighborhood and fitting a linear model. Integrated Gradients and DeepLIFT are gradient-based methods that use a baseline vector to identify the feature dimensions with the highest activations

ancona2018towards . We apply all three attributions techniques on the trained neural network using a neutral baseline ( ) (see  6 for details). We compare the attributions against the accepted global feature importance in two ways: 1) concordance of the top features 2) similarity to the accepted global weights (see Figure  1) Lee12 .

Figure 1: Top 7 features for each technique evaluated by absolute weight (Logistic) or proportion of samples ranked by importance (Attribution). L2 and Rank Dist measure the average L2 norm and Spearman’s rho weighted rank distance between local explanations and global weights.

Experimental results are displayed in Figure  1. Both DeepLIFT and Integrated Gradients produce similar top features. Of logistic regression’s top 7 features, DeepLIFT agrees on 6 and Integrated Gradients agrees on all 7 features. However, the ranking varies across all three methods. On average, Integrated Gradients produces explanations closer to the global weights in terms of rank and L2 (see Figure  1

). Although LIME has a smaller distance to the accepted global explanation, it only identifies two of the top 7 features. This skewed distribution accounts for the smaller norms, but indicates that LIME fails to capture the full set of predictive features.

3.1 Future Research Direction - Attribution Discrepancy

Despite notable agreement, there is discrepancy across all three techniques between generated and accepted explanations. In order for practitioners to trust explanations of neural network credit risk predictions, we propose a future research direction: Is the discrepancy in model explanations due to differences in the learned interactions or due to error in the attributions? One approach would dissect the decision boundary learned by neural networks to compare against the feature interactions learned by a surrogate model. In addition, the research could explore a measure of an attribution’s fidelity to an accepted explanation. This would require a systematic approach by which to compare local attributions against an accepted global explanation.

4 Experiment 2: Reliability of Attribution Methods

The Integrated Gradients and DeepLIFT algorithms require the specification of a reference point or baseline vector (). The explanation of a credit decision is created relative to the selected reference point. Some domains have a clear cut reference - in image processing it is standard to use an all black image. In credit lending, there is no context-specific baseline. In this section, we explore the impact of the reference point on explanations in credit risk modeling (reliability), evaluate possible credit-lending specific reference points, and propose future research directions.

4.1 Experiment - Reliability of Attribution Methods to Baseline Selection

We explore the reliability of both Integrated Gradients and DeepLift to selecting a random baseline. This reference point should serve as a basis to explain a credit risk prediction. We use this as a foundation to evaluate the reliability of each explanation method relative to the choice of reference point. To evaluate reliability, we generate three sets of references (). In the first set, we randomly generated baselines between the minimum and maximum value of each feature in our FICO dataset. In the second set, we interpret the recommendation of a neutral input in Sundararajan et al., 2017 to mean candidates that lie on the decision boundary () sundararajan2017axiomatic . For the third set, we tightly constrain generated reference points to 10 closest candidates on the decision boundary.

Baseline Random Constrained Tightly Constrained
Attribution Method Entropy Std Deviation Entropy Std Deviation Entropy Std Deviation
IG 0.121 0.002 0.115 0.001 0.192 0.001
DeepLift 0.119 0.002 0.103 0.001 0.230 0.000
Table 1:

Attribution baseline sensitivity experimental results. Attribution uncertainty metrics evaluated across three runs. One with totally random baselines, another with heuristically constrained baseline and a third that is tightly constrained. Results imply that a tight constraint reduces uncertainty.

Figure 2: Histogram of entropy values for credit candidates with varying reference points (). Tightly constrained reference points reduce uncertainty of attributions across reference values. This result suggests that attribution methods are sensitive to the reference point and there may be desirable properties associated with varying the reference for each credit candidate.

For a single candidate , we generate attributions — varying the reference for each attribution (

). We then evaluate the uncertainty of attributions for each candidate with two metrics: standard deviation and Shannon entropy defined as


We treat a feature’s attribution across a varied reference as a probability density, so more uniform attributions correspond to higher entropies (see

Appendix). Experimental findings are displayed in Table  1 and Figure  2. We found that varying the reference point can have as large of an impact on a candidate’s explanation as the applicant’s actual profile. However, when references are constrained to only input samples that lie on the decision boundary, there is a significant decrease in the standard deviation of an explanation for a single candidate across both attribution methods. When we constrain references to only the ten most similar candidates on the decision boundary, there is an even larger reduction in dispersion and entropy increases. This result suggests that there may be some variation in the most appropriate reference point for different candidates.

4.2 Intuitive Baselines in Credit Lending are not on the Decision Boundary

To shed light on reference selection for attribution methods in credit lending, we consider three potential candidate references: a candidate that is on the decision boundary, a candidate with no credit history, and the average candidate in our dataset. The results are shown in Table  2.

Unclassifiable Avg Candidate New Candidate
[0.41, 0.59]
Delinquencies 1 1 0
Credit History (years) 16 16 0
Credit Trades* 21 20 0
Accts with Balance 78% 67% NA
Credit Inquiries (Last 6mo) 2 1 0
*Credit trades refers to any agreement between a lending agency and consumers
Table 2: Credit profiles of potential reference points.

We see a notable discrepancy between qualitatively intuitive baselines and the softmax output. These candidates would ideally serve as a neutral point to frame insightful and actionable explanations for a credit decision. However, our model evaluates these candidates as high risk. A candidate that lies on the decision boundary, on the other hand, has no significance in credit lending. Ultimately, this does not mean intuitive options cannot or should not be used. In the following section we propose a research direction for understanding reference selection in credit lending.

4.3 Future Research Direction - Reference Point User Study

We have shown that attribution methods are sensitive to the choice of reference and that there is no obvious reference in credit lending. On the one hand, intuitive references provide a foundation upon which we can produce explanations (e.g. the candidate has more credit trades than the average customer). On the other hand, these intuitive references may not be capable of providing actionable suggestions. Thus, it is important to understand which reference is most understandable, trustworthy, and reliable for the user. We propose a research study that explores the reaction and comprehension of a credit candidate to many different types of explanations DBLP2018 . Specifically, we want to compare suggestion-based explanations with explanations relative to intuitive references. We found empirical evidence that personalizing references increases the reliability of explanations. Therefore, we want to additionally consider personalization by providing candidate-specific references that describe an individual’s best path to credit worthiness.

5 Conclusion

For the successful adoption of advanced machine learning algorithms in credit markets, model interpretability and explainability is necessary to meet regulatory demands and ensure fair lending practices. In this paper, we explored the process of explaining credit lending decisions made by a neural network using three different attribution methods: LIME, DeepLIFT, and Integrated Gradients. Using the widely-accepted logistic regression as a surrogate for ground truth interpretations, we found some evidence that attribution methods are reliable. We also found that attribution methods are sensitive to the selected reference point. We discussed future research direction to help make deep learning perceived as more trustworthy and reliable by those working in credit lending. Disclaimer: the opinions in the paper are the opinions of the authors and not Capital One.


6 Appendix

6.1 Mapping to Original FICO Feature Names

Raw Feature Name Description
RiskPerformance Paid as negotiated flag (12-36 Months)
ExternalRiskEstimate Consolidated version of risk markers
MSinceOldestTradeOpen Since Oldest Trade Open
MSinceMostRecentTradeOpen Since Most Recent Trade Open
AverageMInFile Average Months in File
NumSatisfactoryTrades Number Satisfactory Trades
NumTrades60Ever2DerogPubRec Number Trades 60+ Ever
NumTrades90Ever2DerogPubRec Number Trades 90+ Ever
PercentTradesNeverDelq Percent Trades Never Delinquent
MSinceMostRecentDelq Months Since Most Recent Delinquency
MaxDelq2PublicRecLast12M Max Delq/Public Records Last 12 Months
MaxDelqEver Max Delinquency Ever. See tab "MaxDelq" for each category
NumTotalTrades Number of Total Trades (total number of credit accounts)
NumTradesOpeninLast12M Number of Trades Open in Last 12 Months
PercentInstallTrades Percent Installment Trades
MSinceMostRecentInqexcl7days Months Since Most Recent Inq excl 7days
NumInqLast6M Number of Inq Last 6 Months
NumInqLast6Mexcl7days Number of Inq Last 6 Months excl 7days
NetFractionRevolvingBurden Net Fraction Revolving Burden
NetFractionInstallBurden Net Fraction Installment Burden
NumRevolvingTradesWBalance Number Revolving Trades with Balance
NumInstallTradesWBalance Number Installment Trades with Balance
NumBank2NatlTradesWHighUtilization Number Bank/Natl Trades w high utilization ratio
PercentTradesWBalance Percent Trades with Balance

6.2 Mutual Information

We also measure the joint mutual information of each feature with the target variable. Mutual information quantifies the informational content obtained from one variable through another. It is commonly used for feature selection to determine which set of feature is important for a particular task. We find the top features via global weights learned by logistic regression correspond to the features with higher mutual information. In the context of a binary target

and a feature , we measure joint mutual information as

where are marginal densities. If and are unrelated, then by independence , implying 0 mutual information. The stronger the association between and , the larger the mutual information. Mutual information can be interpreted as the entropy in feature minus the entropy in . Our goal is to identify the features with the maximal joint mutual information to our target. This should correspond to the most important variable. We do so by selecting most important feature by maximizing joint mutual information: for each feature

. Here we estimate mutual information to following the approach described Ross et al. 2014

[15]. We find the features with the high mutual information matches the top feature importances ascertained via logistic regression weights.

6.3 Logistic Regression Classifier Training

We train a logistic regression binary classifer to predict credit risk on the FICO dataset. We ran grid search on the regularization parameter using both l1 and l2 norms. We withhold 33% of samples for validation and achieve a validationa accuracy of 73% on the hold-out set. The dataset has balanced classes.

6.4 Neural Network Model Architecture

We train a neural network binary classifier to predict credit risk on the FICO dataset. The model uses

  • 2 hidden layers with 17 and 5 RELU activation units

  • sigmoid activation for output probability

  • trained for 20 epochs with batch size of 100 samples

The model achieves 73% validation accuracy on the balanced FICO dataset with (1/3 of samples held out for validation).

6.5 Attribution Techniques Background

LIME produces interpretable representations of complicated models by optimizing two metrics: interpretability and local fidelity . Defined as,


The comprehensibility of an interpretable function (e.g. non-zero coefficients of a linear model) is optimized alongside the fidelity or faithfulness of in approximating true function in local neighborhood . In practice, an input point is perturbed by random sampling in a local neighborhood and a simpler linear model is fit with the newly constructed synthetic data set. The method is model agnostic, which means it can be applied to neural networks or any other uninterpretable model. The now explainable linear model’s weights can be used to interpret a particular model prediction. Here we use the implementation of LIME provided by the open-source LIME python package. Integrated Gradients is founded on axioms for attributions. First, sensitivity states that for any two inputs that differ in only a single feature and have different predictions, the attribution should be non-zero. Second, implementation invariance states that two networks that have equal outputs for all possible inputs should also have the same attributions. In this paper, the authors develop an approach that takes the path integral of the gradient for a particular point and the model’s inference on the path () between a zero information baseline and the input .

The path integral between the baseline and the true input can be approximated with Riemman sums. An attribution vector for a particular observation (local) is produced as a result. The authors found that 20 to 300 steps can sufficiently approximate the integral within 5%. DeepLIFT explains the difference in output between a ’reference’ or ’baseline’ input and the original input

for a neuron output:

. For each input , an attribution score is calculated that should sum up to the total change in the output . DeepLIFT is best defined by this described summation-to-delta property:

The author use a function analogous to a partial derivative and use the chain rule to trace the attribution score from the output layer to the original input. This method also requires a reference vector.

6.6 On Entropy Score in Experiment 2

In experiment 2, we state that a lower uncertainty corresponds to a higher entropy. For a probability density function, a higher uncertainty is typically associated with a higher entropy. However, we normalize attributions and use them as a probability density function instead of fitting a distribution to the data. Because we made this choice, a set of very similar attributions across a varied reference would have high entropy. Consider the toy example of a feature attribution across three different references which produced uniform values of

. We would normalize this set of feature attributions to

and treat this as the probability density. Since this density represents a uniform distribution, it would produce maximal entropy.

6.7 Feature Importance and Attributions

Figure 3: Feature Importance for each technique. Learned absolute feature weights are used for logistic regression and frequency ranking across all samples is used for LIME, DeepLIFT, and Integrated Gradients.