1 Deep Learning for Credit Risk Assessment
Deep learning has transformed many areas of data science, but has experienced limited adoption in the financial industry, particularly for assessing credit worthiness. This not due to their performance, which has been shown to eclipse decision trees and logistic regression in credit risk related tasks, but rather a product of their blackbox quality
angelini2008neural fei2015 stroie2013 . Both the Equal Credit Opportunity Act (ECOA), as implemented in Regulation B, and the Fair Credit Reporting ACT (FCRA) require lenders to provide reasons for denying a credit application adversea67 . Neural networks’ blackbox quality can hinder a lender’s ability to assess model fairness, detect bias, and meet regulatory demands. Unlike linear models, where the learned feature weights indicate relative importance, there is no clear mechanism for assessing the reasons behind a neural network’s prediction doshi2017towards . Several attribution techniques have been proposed to explain neural network predictions, but have not significantly filtered into the financial industry NIPS2017_7062 frosst2017distilling . An attribution technique explains a given prediction by ranking the most important features used for generating that prediction. Three prominent techniques are LIME ribeiro2016should , DeepLift shrikumar2017learning , and Integrated Gradients sundararajan2017axiomatic . These methods are theoretically founded and provide human interpretable reasons for a prediction. However, they have not yet been studied or applied in the context of credit assessment. In order to evaluate the viability of attribution methods in credit lending, we ground our analysis in evaluating two important questions:
Do these attribution methods generate accurate and interpretable explanations for a credit decision? (Trustworthiness)

How consistently do attribution methods produce trustworthy explanations for a credit decision? (Reliability)
We explore the trustworthiness and reliability of LIME, DeepLift, and Integrated Gradients and propose two future research directions for the practical use of attribution methods in credit lending kindermans2017reliability adebayo2018local adebayo2018sanity .
2 Background: Explaining Credit Decisions
We consider FICO’s opensource Explainable Machine Learning dataset of Home Equity Line Credit applications. explaina31 . The dataset provides information on credit candidate applications and risk outcome ()  characterized by 90day payment delinquencies (
). To test the reliability and trustworthiness of attributions, we train a feed forward neural network to forecast credit risk (dataset and model details in
Appendix).3 Experiment 1: Trustworthiness of Attributions for Credit Lending
Generating an Accepted Ground Truth
We train a logistic regression classifier to predict credit risk. We then measure feature importance through the learned feature weights. We treat the learned feature importance as the accepted ground truth explanation due to its wide application in credit risk modeling by practitioners, acceptance among regulatory bodies, and theoretical grounding
Creditsc67 . We train a logistic regression on the FICO dataset and use the learned global weights as the benchmark to compare explanations of neural network predictions.Neural Network Explanations
We generate explanations for the neural network model using LIME, DeepLIFT, and Integrated Gradients. LIME produces local explanations by perturbing the input around a neighborhood and fitting a linear model. Integrated Gradients and DeepLIFT are gradientbased methods that use a baseline vector to identify the feature dimensions with the highest activations
ancona2018towards . We apply all three attributions techniques on the trained neural network using a neutral baseline ( ) (see 6 for details). We compare the attributions against the accepted global feature importance in two ways: 1) concordance of the top features 2) similarity to the accepted global weights (see Figure 1) Lee12 .Experimental results are displayed in Figure 1. Both DeepLIFT and Integrated Gradients produce similar top features. Of logistic regression’s top 7 features, DeepLIFT agrees on 6 and Integrated Gradients agrees on all 7 features. However, the ranking varies across all three methods. On average, Integrated Gradients produces explanations closer to the global weights in terms of rank and L2 (see Figure 1
). Although LIME has a smaller distance to the accepted global explanation, it only identifies two of the top 7 features. This skewed distribution accounts for the smaller norms, but indicates that LIME fails to capture the full set of predictive features.
3.1 Future Research Direction  Attribution Discrepancy
Despite notable agreement, there is discrepancy across all three techniques between generated and accepted explanations. In order for practitioners to trust explanations of neural network credit risk predictions, we propose a future research direction: Is the discrepancy in model explanations due to differences in the learned interactions or due to error in the attributions? One approach would dissect the decision boundary learned by neural networks to compare against the feature interactions learned by a surrogate model. In addition, the research could explore a measure of an attribution’s fidelity to an accepted explanation. This would require a systematic approach by which to compare local attributions against an accepted global explanation.
4 Experiment 2: Reliability of Attribution Methods
The Integrated Gradients and DeepLIFT algorithms require the specification of a reference point or baseline vector (). The explanation of a credit decision is created relative to the selected reference point. Some domains have a clear cut reference  in image processing it is standard to use an all black image. In credit lending, there is no contextspecific baseline. In this section, we explore the impact of the reference point on explanations in credit risk modeling (reliability), evaluate possible creditlending specific reference points, and propose future research directions.
4.1 Experiment  Reliability of Attribution Methods to Baseline Selection
We explore the reliability of both Integrated Gradients and DeepLift to selecting a random baseline. This reference point should serve as a basis to explain a credit risk prediction. We use this as a foundation to evaluate the reliability of each explanation method relative to the choice of reference point. To evaluate reliability, we generate three sets of references (). In the first set, we randomly generated baselines between the minimum and maximum value of each feature in our FICO dataset. In the second set, we interpret the recommendation of a neutral input in Sundararajan et al., 2017 to mean candidates that lie on the decision boundary () sundararajan2017axiomatic . For the third set, we tightly constrain generated reference points to 10 closest candidates on the decision boundary.
Baseline  Random  Constrained  Tightly Constrained  

Attribution Method  Entropy  Std Deviation  Entropy  Std Deviation  Entropy  Std Deviation 
IG  0.121  0.002  0.115  0.001  0.192  0.001 
DeepLift  0.119  0.002  0.103  0.001  0.230  0.000 
Attribution baseline sensitivity experimental results. Attribution uncertainty metrics evaluated across three runs. One with totally random baselines, another with heuristically constrained baseline and a third that is tightly constrained. Results imply that a tight constraint reduces uncertainty.
For a single candidate , we generate attributions — varying the reference for each attribution (
). We then evaluate the uncertainty of attributions for each candidate with two metrics: standard deviation and Shannon entropy defined as
(1) 
We treat a feature’s attribution across a varied reference as a probability density, so more uniform attributions correspond to higher entropies (see
Appendix). Experimental findings are displayed in Table 1 and Figure 2. We found that varying the reference point can have as large of an impact on a candidate’s explanation as the applicant’s actual profile. However, when references are constrained to only input samples that lie on the decision boundary, there is a significant decrease in the standard deviation of an explanation for a single candidate across both attribution methods. When we constrain references to only the ten most similar candidates on the decision boundary, there is an even larger reduction in dispersion and entropy increases. This result suggests that there may be some variation in the most appropriate reference point for different candidates.4.2 Intuitive Baselines in Credit Lending are not on the Decision Boundary
To shed light on reference selection for attribution methods in credit lending, we consider three potential candidate references: a candidate that is on the decision boundary, a candidate with no credit history, and the average candidate in our dataset. The results are shown in Table 2.
Unclassifiable  Avg Candidate  New Candidate  
[0.41, 0.59]  
Delinquencies  1  1  0 
Credit History (years)  16  16  0 
Credit Trades*  21  20  0 
Accts with Balance  78%  67%  NA 
Credit Inquiries (Last 6mo)  2  1  0 
*Credit trades refers to any agreement between a lending agency and consumers 
We see a notable discrepancy between qualitatively intuitive baselines and the softmax output. These candidates would ideally serve as a neutral point to frame insightful and actionable explanations for a credit decision. However, our model evaluates these candidates as high risk. A candidate that lies on the decision boundary, on the other hand, has no significance in credit lending. Ultimately, this does not mean intuitive options cannot or should not be used. In the following section we propose a research direction for understanding reference selection in credit lending.
4.3 Future Research Direction  Reference Point User Study
We have shown that attribution methods are sensitive to the choice of reference and that there is no obvious reference in credit lending. On the one hand, intuitive references provide a foundation upon which we can produce explanations (e.g. the candidate has more credit trades than the average customer). On the other hand, these intuitive references may not be capable of providing actionable suggestions. Thus, it is important to understand which reference is most understandable, trustworthy, and reliable for the user. We propose a research study that explores the reaction and comprehension of a credit candidate to many different types of explanations DBLP2018 . Specifically, we want to compare suggestionbased explanations with explanations relative to intuitive references. We found empirical evidence that personalizing references increases the reliability of explanations. Therefore, we want to additionally consider personalization by providing candidatespecific references that describe an individual’s best path to credit worthiness.
5 Conclusion
For the successful adoption of advanced machine learning algorithms in credit markets, model interpretability and explainability is necessary to meet regulatory demands and ensure fair lending practices. In this paper, we explored the process of explaining credit lending decisions made by a neural network using three different attribution methods: LIME, DeepLIFT, and Integrated Gradients. Using the widelyaccepted logistic regression as a surrogate for ground truth interpretations, we found some evidence that attribution methods are reliable. We also found that attribution methods are sensitive to the selected reference point. We discussed future research direction to help make deep learning perceived as more trustworthy and reliable by those working in credit lending. Disclaimer: the opinions in the paper are the opinions of the authors and not Capital One.
References
 [1] Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. ICLR 2018 Workshop, 2018.
 [2] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. arXiv preprint, 2018.
 [3] Sarah Ammermann. Adverse action notice requirements under the ecoa and the fcra  consumer compliance outlook. https://consumercomplianceoutlook.org/2013/secondquarter/adverseactionnoticerequirementsunderecoafcra/, 2013. Federal Reserve Bank of Minneapolis.
 [4] Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better understanding of gradientbased attribution methods for deep neural networks. In International Conference on Learning Representations, 2018.
 [5] Eliana Angelini, Giacomo di Tollo, and Andrea Roli. A neural network approach for credit risk evaluation. The quarterly review of economics and finance, 48(4):733–755, 2008.
 [6] Finale DoshiVelez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
 [7] FICO. Explainable machine learning challenge. https://community.fico.com/s/explainablemachinelearningchallenge, 2018.
 [8] Nicholas Frosst and Geoffrey Hinton. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.

[9]
Samsul Islam, Lin Zhou, and Fei Li.
Application of artificial intelligence (artificial neural network) to assess credit risk: A predictive model for credit card scoring, 04 2015.
 [10] PieterJan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867, 2017.
 [11] Paul H Lee and LH Philip. Mixtures of weighted distancebased models for ranking data with applications in political studies. Computational Statistics & Data Analysis, 56(8):2486–2500, 2012.
 [12] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
 [13] Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale DoshiVelez. How do humans understand explanations from machine learning systems? an evaluation of the humaninterpretability of explanation. CoRR, abs/1802.00682, 2018.
 [14] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
 [15] Brian C. Ross. Mutual information between discrete and continuous data sets. PLOS ONE, 9(2):1–5, 02 2014.
 [16] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017.
 [17] Nikos Skantzos. Credit scoring: Case study in data analytics. https://www2.deloitte.com/content/dam/Deloitte/global/Documents/FinancialServices/gxbeaersfsicreditscoring.pdf, 2016. Deloitte Financial Services.
 [18] Laura Badea Stroie. Techniques for customer behaviour prediction: A case study for credit risk assessment. https://ec.europa.eu/eurostat/cros/system/files/NTTS2013fullPaper_171.pdf.
 [19] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
6 Appendix
6.1 Mapping to Original FICO Feature Names
Raw Feature Name  Description 

RiskPerformance  Paid as negotiated flag (1236 Months) 
ExternalRiskEstimate  Consolidated version of risk markers 
MSinceOldestTradeOpen  Since Oldest Trade Open 
MSinceMostRecentTradeOpen  Since Most Recent Trade Open 
AverageMInFile  Average Months in File 
NumSatisfactoryTrades  Number Satisfactory Trades 
NumTrades60Ever2DerogPubRec  Number Trades 60+ Ever 
NumTrades90Ever2DerogPubRec  Number Trades 90+ Ever 
PercentTradesNeverDelq  Percent Trades Never Delinquent 
MSinceMostRecentDelq  Months Since Most Recent Delinquency 
MaxDelq2PublicRecLast12M  Max Delq/Public Records Last 12 Months 
MaxDelqEver  Max Delinquency Ever. See tab "MaxDelq" for each category 
NumTotalTrades  Number of Total Trades (total number of credit accounts) 
NumTradesOpeninLast12M  Number of Trades Open in Last 12 Months 
PercentInstallTrades  Percent Installment Trades 
MSinceMostRecentInqexcl7days  Months Since Most Recent Inq excl 7days 
NumInqLast6M  Number of Inq Last 6 Months 
NumInqLast6Mexcl7days  Number of Inq Last 6 Months excl 7days 
NetFractionRevolvingBurden  Net Fraction Revolving Burden 
NetFractionInstallBurden  Net Fraction Installment Burden 
NumRevolvingTradesWBalance  Number Revolving Trades with Balance 
NumInstallTradesWBalance  Number Installment Trades with Balance 
NumBank2NatlTradesWHighUtilization  Number Bank/Natl Trades w high utilization ratio 
PercentTradesWBalance  Percent Trades with Balance 
6.2 Mutual Information
We also measure the joint mutual information of each feature with the target variable. Mutual information quantifies the informational content obtained from one variable through another. It is commonly used for feature selection to determine which set of feature is important for a particular task. We find the top features via global weights learned by logistic regression correspond to the features with higher mutual information. In the context of a binary target
and a feature , we measure joint mutual information aswhere are marginal densities. If and are unrelated, then by independence , implying 0 mutual information. The stronger the association between and , the larger the mutual information. Mutual information can be interpreted as the entropy in feature minus the entropy in . Our goal is to identify the features with the maximal joint mutual information to our target. This should correspond to the most important variable. We do so by selecting most important feature by maximizing joint mutual information: for each feature
. Here we estimate mutual information to following the approach described Ross et al. 2014
[15]. We find the features with the high mutual information matches the top feature importances ascertained via logistic regression weights.6.3 Logistic Regression Classifier Training
We train a logistic regression binary classifer to predict credit risk on the FICO dataset. We ran grid search on the regularization parameter using both l1 and l2 norms. We withhold 33% of samples for validation and achieve a validationa accuracy of 73% on the holdout set. The dataset has balanced classes.
6.4 Neural Network Model Architecture
We train a neural network binary classifier to predict credit risk on the FICO dataset. The model uses

2 hidden layers with 17 and 5 RELU activation units

sigmoid activation for output probability

trained for 20 epochs with batch size of 100 samples
The model achieves 73% validation accuracy on the balanced FICO dataset with (1/3 of samples held out for validation).
6.5 Attribution Techniques Background
LIME produces interpretable representations of complicated models by optimizing two metrics: interpretability and local fidelity . Defined as,
(2) 
The comprehensibility of an interpretable function (e.g. nonzero coefficients of a linear model) is optimized alongside the fidelity or faithfulness of in approximating true function in local neighborhood . In practice, an input point is perturbed by random sampling in a local neighborhood and a simpler linear model is fit with the newly constructed synthetic data set. The method is model agnostic, which means it can be applied to neural networks or any other uninterpretable model. The now explainable linear model’s weights can be used to interpret a particular model prediction. Here we use the implementation of LIME provided by the opensource LIME python package. Integrated Gradients is founded on axioms for attributions. First, sensitivity states that for any two inputs that differ in only a single feature and have different predictions, the attribution should be nonzero. Second, implementation invariance states that two networks that have equal outputs for all possible inputs should also have the same attributions. In this paper, the authors develop an approach that takes the path integral of the gradient for a particular point and the model’s inference on the path () between a zero information baseline and the input .
The path integral between the baseline and the true input can be approximated with Riemman sums. An attribution vector for a particular observation (local) is produced as a result. The authors found that 20 to 300 steps can sufficiently approximate the integral within 5%. DeepLIFT explains the difference in output between a ’reference’ or ’baseline’ input and the original input
for a neuron output:
. For each input , an attribution score is calculated that should sum up to the total change in the output . DeepLIFT is best defined by this described summationtodelta property:The author use a function analogous to a partial derivative and use the chain rule to trace the attribution score from the output layer to the original input. This method also requires a reference vector.
6.6 On Entropy Score in Experiment 2
In experiment 2, we state that a lower uncertainty corresponds to a higher entropy. For a probability density function, a higher uncertainty is typically associated with a higher entropy. However, we normalize attributions and use them as a probability density function instead of fitting a distribution to the data. Because we made this choice, a set of very similar attributions across a varied reference would have high entropy. Consider the toy example of a feature attribution across three different references which produced uniform values of
. We would normalize this set of feature attributions toand treat this as the probability density. Since this density represents a uniform distribution, it would produce maximal entropy.