PSD2 Explainable AI Model for Credit Scoring

by   Neus Llop Torrent, et al.

The aim of this paper is to develop and test advanced analytical methods to improve the prediction accuracy of Credit Risk Models, preserving at the same time the model interpretability. In particular, the project focuses on applying an explainable machine learning model to PSD2-related databases. The input data were obtained solely from synthetic account transactions generated from a pool of commercial banks from a pool of Italian commercial banks. Over the total proven models, CatBoost has shown the highest performance. The algorithm implementation produces a GINI of 0.45 after tuning the hyper-parameters combined with their inherent class-weight resampling method. SHAP package is used to provide a global and local interpretation of the model predictions to formulate a human-comprehensive approach to understanding the decision-maker algorithm. The 20 most important features are selected using the Shapley values to present a full human-understandable model that reveals how the attributes of an individual are related to its model prediction.



page 1

page 2

page 3

page 4


Transparency, Auditability and eXplainability of Machine Learning Models in Credit Scoring

A major requirement for credit scoring models is to provide a maximally ...

Explainable AI for Interpretable Credit Scoring

With the ever-growing achievements in Artificial Intelligence (AI) and t...

Explainable AI in Credit Risk Management

Artificial Intelligence (AI) has created the single biggest technology r...

Logistic Ensemble Models

Predictive models that are developed in a regulated industry or a regula...

Explainable Ordinal Factorization Model: Deciphering the Effects of Attributes by Piece-wise Linear Approximation

Ordinal regression predicts the objects' labels that exhibit a natural o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Credit Scoring is defined as the set of decision models and their underlying techniques that support lenders evaluate consumer credit thomas2017credit

. Hence, is the use of statistical methods normally adopted by banks and financial entities to estimate the likelihood that a loan applicant will or will not default

gup2005commercial. Credit score is usually developed starting from factors such as the payment history, type and time of the credit application, outstanding debt and length of credit barron2003value.

New online ordering facilities, innovative payment mechanisms, friendlier online platforms and the improvement of user experience on e-commerce, triggered the expansion of credit card usage and accounts creation. Based on the Global Payments Cards Data and Forecasts to 2024 made by Retail Banking Research(RBR), in 2018 cashless payments increased by an 18% being payments cards the biggest contributors, accounting for the 57%. Besides, while debit card payments grow 6%, check declined 7%. The raise on cashless methods are resulting in a transformation of the facilities and conditions to access credit. One interesting case is Russia, where the Faster Payments System introduced in 2019 allows instant fund transfers via mobile phones number and QR codes. The results of these regulatory changes combined with new payment technologies like contactless (which increased a 25% in 2018) were reported by RBR in the Payment Cards Issuing and Acquiring Europe 2020 to be the key causes for the 9% increment of card acceptance on 2018. As a consequence, the demand for automatized plans has drawn the attention of commercial banks, willing to find more accurate and fast techniques to track client’s loan eligibility not to fall behind in the technological revolution of online payments after the PSD2 open banking revolution gomber2018fintech. The failure of the implementation of new techniques could have damaging consequences for this sector if it is not able to modernize and implement the technology demanded by the customers.

On the other hand, transactions’ accounts give an idea of the spending habits of the citizens, a highly relevant macroeconomic factor on systematic risk KHANDANI. The huge amount of data also exemplifies the wide window of decisions to which the customers are exposed. Credit card practices can reveal not only current life-styles but also expectations or preferences for their future way of life. Additionally, apart from a reflector of the reality of a customer group, which can be taken from the credit bureau data, transaction data can be displayed as a gate to identify inclinations that explain their desired life. The consumption behaviour and life-practices can be designated as an explanatory frame for client ambitions, threats and preferences which is a more genuine representation that their current state cclifestyle. Therefore, considering clients’ irrationality, the study goes beyond customer’s classification by trying to explain, to which extend, some variables can show a tendency of subjects not self-identifying to their social class by not behave accordingly to their predictable behaviour.

Most common scorecard methods implement the well-known Logistic Regression, which notably reduced the time of assessment of applicants. While logistic regression can identify the reasons behind the model choice, its major drawback is the incapability to capture the non-linearity correlation among features

wang2015large. On the other hand, Machine Learning models have lately shown an increase in the prediction power for Credit Risk Modelling, although they do not provide reliable explanations for the scores they come up. That is a particularly delicate issue in CRM since it is a highly regulated field: the General Data Protection Regulation (GDPR), as well as the "Ethical Guidelines for trustworthy, AI" Ethical and the Report from the "European Banking Authority" testify the care dedicated to such topics by the European Community AIandLaw.

The Credit Scoring ecosystem made a 180 degrees shift when the Payment Services Directive 2 (PSD2) introduced a new regulatory framework for the Single Euro Payments Area. The main objectives of PSD2 are improving customer’s protection and security while encouraging innovation and competition among the players on the payments industry by ensuring that all have the same accessibility to data. To do so, PSD-2 gives free access through bank’s APIs to third parties, providing them access to their client’s accounts data. However, due to the unprecedented changes in legislation this directive accounted for, there are still no clear policies detailing the technologies to use, and the type of data banks are obligated to share. Serving its purpose of promoting entrepreneurship, PSD2 successfully accelerated the creation of Paytechs start-ups in Europe from 2018 to 2019, being the customer’s cashless paying habits one of the drivers of its success polasik2020impact.

The new open banking era has also awakened the interest of Big Techs, remaining in the spotlight for the moment to enter the financial services industry. Whereas there is no clue on when and how this may occur, Apple already launched its credit card in 2019 while Google may begin opening consumer bank accounts in 2021


Currently, alternative credit scoring systems are treated as protected trade secrets, raising concerns about privacy and emphasising the lack of transparency in how data is being collected and used CreditReporting. For processing these volumes of data in a reasonable amount of time, advanced AI and Big Data techniques are required. After the publication in April 2019 of the Ethical Guidelines for Trustworthy AI by the European Commission, not being able to explain to a customer why he/she is not suitable to receive a loan has not been considered a valid solution anymore. Hence, to ensure AI transparency, explainability has to be secured. AI models explainability had drawn all the attention assisting the trustworthiness criteria. Therefore, before exploiting the Machine Learning potential in the Credit Scoring field, it is mandatory to address the interpretability issue. How reasonable is to make the machine learning ’black box’ models responsible for its judgments when there is not a clear comprehension of the practice developed to reach their decisions? For instance, segregating by gender, location, origin, race may be inevitable without a clear understanding of the process followed by the algorithm CreditReporting.

To manage this trade-off, our proposal concerns applying state of the art tools on top of a well-performing black-box algorithm. By doing so, we wish to retain the increased predictive power of Machine Learning, while providing meaningful explanations to the applicants involved, as well as to the regulator. The final goal of the work is to identify defaulted customers within their first 12-month relationship with the bank.

This problem was addressed by Jing Zhou from the Renmin University of China which proposed a method for creating features for credit scoring focusing on the frequency, recentness and monetary value of the account information, using ML models huang2018rfms

. For historical transaction data, new time-series approaches based on Recurrent Neural Networks and LSTM are showing outstanding results, even they are still in a research phase


. On the other hand, more classical approaches such as Logistic Regression, Decision Trees and GradientBoosting Classifiers are also implemented for more complex data sets


In this paper, we will show how to create an explainable credit risk model based solely on synthetic account transactions generatedand balances from a pool of Italian commercial banks. We will discuss how we created the feature vectors from the raw data, how the model operates and the reasoning behind the forecasted decisions. Furthermore, we will show how it is possible to interpret the model decisions, taking advantage of state-of-the-art explainability libraries, and we will discuss the model performances and behaviour.

Ii Dataset Description

The data used comes from a synthetic dataset generated from a pool of commercial bankssubset of customers from a pool of Italian commercial banks. Data contains account, transaction and client information. The collection covered three major client types: customers, freelancers, and companies. According to the scope of the analysis, only customer and freelancers data were selected.

ii.1 Data sources

The four primary datasets collected four tables with differing information. The analysis will begin with a description of each table, followed by an explanation of how the distinct tables are combined to establish the final dataset.

The first dataset covers information from the client. It contains 24,412 different clients and 8 variables. The features in this dataset are client id, calculation date, performance date, subject type and the client performance with its respective score. The calculation date is the day when the client asked for the credit which should not be confused with the performance date which is the evaluation of the client one year after he has requested the loan. The credit history of the clients has been used to assess the creditworthiness of the clients. When a client has been insolvent for three times in a row it is considered a bad payer and the Performance variable is set to 1. In the opposite case, the Performance will be 0 meaning it is a good customer. To evaluate the performance of the client. The Performance Score feature gives an idea of the model’s prediction confidence. Two different subsets were created with different sizes to evaluate the creditworthiness but only the results of the largest subset were considerate. Finally, the Subject Type variable contains information regarding the type of client, which accounts for 15,486 individuals and 8,926 freelancers. All features except the small subset performance and performance score were selected to proceed with the analysis.

The second source comprises the accounts information. It contains 41,194 different accounts and 6 variables. The features in this dataset are the account ID, the account balance, the date when the account balance was checked, the type of account, the account currency and day of its creation. From all the initial the currency and account creation day were removed to be considerate irrelevant for the analysis. The account type will be replaced by the subject type referring to the type of client associated with the account. The account balance date is the same for all the accounts. This day will be used to backpropagate the Account Balance amount for all days from this date until the beginning of the data collection. The third dataset collects all the transactions records. It holds 7,785,027 different transactions from the 41,194 different accounts. The 4 variables in this dataset are the transactions ID, the account ID from which had been performed the transaction and the date and amount of the transaction.

Finally, the last table serves as a bridge that connects the clients with their accounts. It contains 43,292 rows regarding different client-account pairs. From them, 41,194 accounts where identified mapped to 36,899 different clients.

Figure 1: Scheme describing the main characteristics of the different tables and how they are related. Words coloured in red define the key of the table.Boxes in black identify the features used to join the tables.

The aggregated dataset on Figure 1 contains the selected features from every table merged using the transaction id as the identification key. To relate the different tables a unique identifier is selected for each table. The account and client tables act as a link to merge the account table and the client table using the accounts and clients IDs. Ultimately, the transaction table is merged using the accounts ids.

In the aggregated dataset a collection of 28,108 accounts were identified from the total of 4,862,772 transactions merged table. From all the records, the individuals and freelancers were considered as the same type of clients (company accounts were removed). Then from the account table, the accounts with less than one transactions for 90 days before the Calculation Day were also taken away.

ii.2 Data analysis

Consumer and freelancer account will be used in this work. They will be grouped together, making it difficult to understand to what extent an account was used solely for business purposes. The majority of accounts with a 63.8% were customers were the remaining 36.6% were freelancers.

From the 24,412 total number of clients, 23,299, a 95.4 % were rated Good clients while only a 4.56% (1,113) were bad as shown in Figure 2. Consequently, as the target variable has more observations in one particular class than the other, it can be considered an unbalanced dataset which will influence the algorithm behaviour. The presence of an unbalanced dataset may result in the overfitting of the majority class as the model will tend to favour the majority class regardless of the input variables (This problem will be addressed in Section ).

Figure 2: Histogram representing the distribution of bad and good customers along the dataset.

The account balance distribution is slightly shifted toward the positive balance as shown in Figure 3. The maximum is found is € while the minimum account balance is

€ . Given the magnitude of these extreme outliers, a better interpretation is done considering the central values.

Figure 3: Box plot representing the distribution of customers’ account balance. The top and bottom 10% outliers were removed from the graph. The data accounts for 80% of the total.

The account-client relation is not bidirectional and unique. In fact, an important amount of accounts have more than a single client as it is also common to find clients with more than one account. 82.2% of the accounts had a single owner, 16.3% had two owners and only 1% had 3 owners. Having a look at the distribution of accounts owned by the same client, half of the clients, a 54.4% were owners of one account followed by a 34.5% owning two and a 11.1% owning three or more. The distribution of the transactions is shifted toward the negative side as shown in Figure 4. The maximum is € while the minimum transaction is € .

Figure 4: Histogram of the distribution of the transactions amount. The top and bottom outliers were removed from the graph. The data accounts for 90% of the total.

Iii Data Manipulation

iii.1 Feature vectors construction

We propose a learning algorithm capable of estimating the account’s probability to default. Practically, this means it does not predict a good or bad customer but instead reports a likelihood. When the probability surpasses a proposed threshold set to

by default, the account will be considered as bad. Interestingly, this popular cutoff is not backed by any theoretical justification as denoted by Eric Rosenberg rosenberg1994quantitative. Further complicating the issue, these predictions are reported as point estimates (with its model’s implied error).

First, transactions are used to get the account balance for each day. Second, the different KPIs will be created for a window of three months, considering the day the client asked for the credit as the end.

iii.2 Pre-processing and Feature Engineering

This part addresses the processing stage necessary to handle the irregular temporal distribution of transactions for each account. In particular, the number of events varies for each customer and is not constant along the time. To control this, we built 3 months windows to establish a constant period. This is followed by a creation of variables through alternating the length and time location of the periods and computing aggregation operations using the transactions and account balance. As a result, 112 variables are obtained for a total of 27,368 accounts. After this process, all the accounts with the new shortened periods empty, will be removed from the dataset.

iii.3 Feature Selection and Dimensionality Reduction

Specific features may not be useful for modelling. The criteria we adopted to find those features were variables with: single values, a high percentage of missing values and high correlation. Zero features had one single value. For all variables, a percentage of less than a 0.7% of missing values was measured. Hence, no variables were removed based on this criteria. Both NaNs and zeros were also considerate missing values. 34 features were removed due to a correlation higher than 0.98%. The collinearity between pairs of features was calculated using the Pearson correlation coefficient. For each pair, when the correlation was higher than the specified threshold, the second variable by order of appearance was removed. From the 112 variables, 78 were kept.

After this first feature reduction, different models were run to obtain a benchmark.

However, to have an interpretable model, a smaller set of features is needed. Hence, to proceed with the dimensionality selection, the SHAP values will be used to select the most important features lundberg2017unified. This condition cannot be reached until a model has been trained. To be consistent with our model selection we choose CatBoost as the benchmark algorithm. In the explainability section of the paper, we will cover how the variables are selected using SHAP. It is just worth mentioning that the feature importance is a measure of the average impact the variable has on the model performance. Hence, by taking the mean of the absolute value of the SHAP values of the variable we can rank the variables by the influence they had on the prediction.

Based on the SHAP variable importance, we selected the 20 most important features for the model.

iii.3.1 Scaling

Deep learning algorithms demand standard normal distributed data, which implies a mean centred at zero and a standard deviation equal to one

. The unit variance will restrict the model from having a preference for the features with a variance of higher orders of magnitude. Thus, we introduce a scaling


for the sample values of the neural network. For the remaining models, no normalization or standardization will be performed as these methods alter the relative distance between the feature’s vectors and may impact on the forecast.


iii.3.2 Unbalanced Dataset

As discussed in section two, there is an asymmetrical number of good /bad customers as the formers are largely more popular. Therefore, it can be studied as an unbalanced dataset problem. As the amount of bad customers accounts for the of the data, it may be reasonable to interpret it as an imbalanced dataset problem seeing the bad customers as a rare event. However, the nature of the problem makes it intuitively reasonable to consider it as an unbalanced problem, because, although it is less frequent the default of an account, it is still a widely spread event. Before considering any resampling method the data has to be split between the training and testing set as the method will be only applied for the training set.

Hence, we proposed and tested a collection of 4 different resampling methods: undersampling the majority class, oversampling the minority class, the Synthetic Minority Oversampling Technique (SMOTE) and proportionally class weighting.

Randomly undersampling the majority class method consists of taking all the samples from the minority class and take from the majority class the same amount of samples. This method has the disadvantage that there is a significant quantity of data that will never be used to train the model. As a consequence, the model could underfit since important information from the majority class may have been lost more2016survey. The opposite approach will be randomly oversampling the minority class. In this method instead of throwing part of the samples, the data from the minority class is replicated to reach the same size as the dominant class. However, as the samples of the non-dominating class are just a copy, they may be useless to gain knowledge from the data. Unlike oversampling, to solve the potential overfitting, SMOTE differentiates by creating new synthetically samples for the minority class using the nearest neighbours algorithm chawla2002smote. Two improved variations of vanilla SMOTE will be explored. Baseline-SMOTE differentiates by focusing on the point that fall in the borderline between the classes. Then, it focuses on mainly taking the point of the minority class that was close to the border incorrectly classified and uses them for the synthetic duplication. That results in a better understanding of the region of conflict, bringing more knowledge where the algorithm finds difficulty in differentiating the classes han2005borderline

. A similar alternative is SVM-SMOTE, which follows the same idea but using a Super Vector Machine instead of a KNN to identify the points that fall in the frontier


Finally, class-weights changes the weight of each class in the objective function. We will use class weight by assigning a weight of 1 to class 0 and a weight of for class 1. For the Catboost algorithm we will also use the auto class weigth with the sqrtbalanced (2) for the values that multiply the objective function prokhorenkova2018catboost.


All the methods will be combined with all the models to find the best performance.

The second problem that arises when dealing with unbalanced datasets is the metrics used to evaluate model performance. While most algorithms maximize accuracy, this metric can not be used in unbalanced datasets as it will build a dummy classifier that will only predict the majority class. Instead, the interest is not only to make the largest amount of good classifications overall but have a good proportion of well-classified samples for both the majority and minority class. Hence, the selection of the metrics will be key to evaluate the results. This topic will be widely covered in section five.

Iv Model

Prediction models seek to find a relation linking the target variable with the independent ones. Once we obtain an estimate of such dependence, we may use it to predict the value of the target variable for new individuals.

Credit Risk models consider the borrower’s default as target variable (1 if the default occurred, 0 otherwise). Generally, the models try to predict the probability of default (PD), which can assume any continuous value from 0 to 1.

This section provides a theoretical background for the models employed in this paper.

For the objective of this analysis, binary classification is preferable. For this reason, a threshold between 0 and 1 should be defined, generally 0.5, which will mark the division of the two classes.

We will divide the dataset and use of it for training and the remaining for testing. To obtain more accurate results the model is trained using a 5-fold cross-validation comparison of the different techniques and resampling methods.

The algorithms have been implemented using the Scikit-learn library and the Tensorflow-Keras machine learning software frameworks.

iv.1 Logistic Regression

One of the most common techniques in Credit Risk is Logistic Regression. We provide a brief model’s overview, while the interested reader may refer to visani2019explanations for an in depth analysis of the Logistic Regression, applied on Credit Scoring data.

The key idea is to model the conditional mean , from now on , wrt the independent variables .

To do so, we shall consider some constraints on , namely it is bounded in the interval . Therefore, we employ the transformation which takes values in . It is now legit to assume a linear relationship between and the independent variables , namely . The formula may be rewritten with respect to , which higlights the non linear relationship with the variables, namely .

The vector contains the intercept and the coefficients for each variable, which represent the slope of the non linear relationship. We should find the best values for the parameters, this is usually done by the Newton-Rhapson optimization framework, which finds the parameters achieving the maximum log-likelihood for our dataset greene2008econometric.

While Logistic Regression made a good work on associating a probability to a prediction and it didn’t assume was a linear combination of

, it has the limitations of not being able to capture non-linear trends as it considers that there is linearity between the independent variables and log odds

bolton2010logistic. As a consequence, we present a range of ML models intending to overcome this major drawback.

On a completely different line, the following models attempt to recover the relation between and the without assuming it to belong to a class of parametric functions. In this way, they are non-parametric and generically called Machine Learning models.

iv.2 Random Forest

The Random Forest exploits Decision Tree models - very simple non-linear models which cut the geometrical space of the

variables recursively, with the aim to cluster together regions with the same value of the target variable -.
Random Forest aggregates together many simple decision trees, using the Bagging procedure, to increase their prediction power. In addition, the trees are generated more arbitrarily, choosing randomly the split variable at each node. This procedure increases the diversity among trees and consequently improves the performance of the ensemble model.
The model is capable to represent highly non-linear functions and usually achieves good predictions. Another strong point of Random Forest is overfitting: thanks to the bagging procedure, the model does not suffer a decrease in accuracy when expanding the number of trees.

In order to improve the performance of the model, the exploratory analysis will be done. Some of the important parameters to be tested are the number of estimators and their depth. Hence, it will be crucial finding the right amount of decision trees and leaves to prevent overfitting.

iv.3 Gradient Boosting

On the same page, also Gradient Boosting employs Decision Trees. Differently from Random Forest, the model utilises the boosting procedure as aggregation technique: small trees are sequentially added to the model to reduce the loss, while keeping the previous trees fixed. Each tree focuses more on the individuals which have been badly predicted from the previous trees. This training phase is guided by the Gradient on the errors of the preceding trees, hence the name Gradient Boosting. For binary classification problems, Gradient Boosting uses the Cross-Entropy Loss (


) as loss function:


The technique usually achieves very good results, although it is prone to overfitting. To control it, it is good practice to use early-stopping on the number of trees, during the training phase. Grid Search is suggested to tune the other hyperparameters


For a more efficient implementation of the algorithm, the Light Gradient Boosting model will also be tested ke2017lightgbm.

iv.4 CatBoost

CatBoost is an implementation of Gradient Boosting machine learning tool developed by the Russian tech company Yandex. Different incentives lead to the creation of the algorithm. For instance: the treatment of data coming from various sources and the need for handling categorical variables. Moreover, it is constant to parameter changes and eliminates the costly task of tuning, showing great results from the first runs


CatBoost uses oblivious decision trees kohavi1994bottom. Compared to classic Decision Trees, the oblivious version imposes that nodes at the same height in the tree should use the same variable for the splitting. The modification is justified as to prevent overfitting, making it more stable to parameter changes kohavi1995oblivious. Oblivious trees are an easily parallelizable algorithm, which allows training using GPUs, reducing the time to obtain a properly tuned model.

CatBoost uses category-based statistics to handle categorical values. It considers that the encoding of the categorical features is better performed by the algorithm itself, rather than by humans. To do so, it creates numerical features from the categorical ones, by using the category’s number of occurrences in the dataset.

The difference between the classical Gradient Boosting and CatBoost algorithm is that, in the former, the leaf values are calculated averaging the gradients in the current leaf which means it is an estimate of the gradient for all possible individuals in the leaf. This intrinsically introduces a bias, due to considering the model predictions on the same individuals using for training. To overcome the overfitting problem, CatBoost computes the gradient for each individual separately prokhorenkova2018catboost. Then, the gradient will be based only on the individuals in the tree before the one being assessed. In practice, it trains the logarithm of the models, which are trained at the same time. Consequently, this approach will work well with small datasets as it is computationally expensive dorogush2018catboost.

iv.5 Neural Network

Neural Networks are processing algorithms with an architecture that follows the brain biological structure. They are inspired to mimic the function of the human brain by feeding information through different layers and nodes. The simplest form is called multilayer perceptron (MLP) and represents a feed-forward network, namely the information flows, in a single direction and only once, through the framework. Its basic structure consists of an input layer followed by an undefined number of hidden layers and a final output layer that outputs the predicted value of the dependent variable. Each layer possesses different nodes which are responsible for computing a weighted sum of the input information received from the nodes of the foregoing layers. This result will be sent to a non-linear activation function. The process is repeated for all the layers, until the output layer.

More advanced networks change the propagation scheme through the network, breaking the feed-forward mechanism and allowing for more complex interactions among the nodes.

Back-propagation is the most employed neural training method, it consists of an efficient propagation of the gradient-based errors through the nodes in a backward fashion, which allows for optimization of the network’s weights. Different types of gradient descent can be used, in our implementation, we employ the Stochastic Gradient Descent (SGD).

V Results

To perform the evaluation of the different models we propose the AUC-ROC curve and the GINI index. We present a comparison of the four methods previously described.

v.1 Evaluation Metrics

To quantify the number of correct classifications the model makes for each class, counting the overall number is not enough. For instance, if the accuracy is high but all the correctly classified samples come from the majority class, the model is useless and does not provide any relevant output for the task. Therefore, we will consider different metrics that recognise the relevance of the model predictions.

The confusion matrix aggregates all the classification information in a table. Rows represent the true values and columns the predicted ones. Each element in the table denotes a different option. True positives and negative will be the individuals correctly classified by the model as good and bad payers respectively. On the contrary, false positive and negative will be good or bad customers who were incorrectly identified.

However, the confusion matrix can’t help on evaluating the performance.

Consequently, the metrics proposed are the ROC-AUC and Gini index. The ROC is a probability curve in the form of a graph that represents the proportion of the true positive rate against the false positive rate . It shows from all the possible thresholds the performance of the model making the final result invariant to it. To compute the model efficiency for separating the classes, the AUC (Area Under the Curve) formula evaluates the aggregated response of the model, given all the thresholds from 0.5 to 1, to classify a trustworthy customer over an untrustworthy one.

The Gini index (4) is derived from the AUC and it is a standard metric used in risk assessment. As AUC, it gives the ratio between true positive and true negatives but in the range between 0 and 1, making it more intuitive.


v.2 Experiments

The first experiment runs the 4 models. The classical Logistic Regression is compared with Gradient Boosting, CatBoost and the Neural Network models. Figure 5 shows how Gradient Boosting is the best in class, with CatBoost and Neural Networks following close while Logistic Regression has the worst performance in terms of AUC.

Figure 5: The illustration shows the AUC-ROC curve with the true positive against the false positive rate for the different models. LR: Logistic Regression; GBT: Gradient Boosting; CBT: CatBoost; NN: Neural Networks
Classifiers Resampling GINI (std) 20F GINI (std)
Logistic Regression BorderlineSMOTE 0.3365 (0.0482) 0.3556 (0.0543)
SVMSMOTE 0.3386 (0.0516) 0.3391 (0.0517)
Random Forest BorderlineSMOTE 0.3612 (0.0280) 0.3672 (0.0295)
SVMSMOTE 0.3800 (0.0355) 0.3674 (0.0321)
Oversampling Minority 0.4100 (0.0058) 0.4017 (0.0067)
Undersampling Majority 0.4022 (0.0720) 0.3988 (0.0052)
Gradient Boosting BorderlineSMOTE 0.4193 (0.0390) 0.3985 (0.0339)
SVMSMOTE 0.4254 (0.0400) 0.4082 (0.0383)
Oversampling Minority 0.4174 (0.0007) 0.4200 (0.0008)
Undersampling Majority 0.4178 (0.0007) 0.4204 (0.0011)
CatBoost Class-weight 0.4543 (0.0369) 0.4577 (0.0382)
Auto-class-weight 0.4417 (0.0431) 0.4431 (0.0375)
BorderlineSMOTE 0.4377 (0.0328) 0.3880 (0.0450)
SVMSMOTE 0.4399 (0.0360) 0.4021 (0.0419)
Oversampling Minority 0.4152 () 0.4147 ()
Undersampling Majority 0.4152 () 0.4147 ()
Neural Networks Class weight 0.4082 (0.0190) 0.4108 (0.0239)
Table 1: Comparative table of different resampling and classification methods with their corresponding Gini index and error. The third column refers to the models trained on the dataset after the pre-processing step, with a total of 78 features. The fourth column shows the GINI for the model trained on the 20 most important features, obtained through the Shapley values built on the Catboost model with all the features. The error estimated as the standard deviation from the results of the 5 fold cross-validation is detailed in parentheses.

The third column in Table 1 reveals that the highest GINI was obtained combining CatBoost with Auto-weights resampling method followed by Class-weights. Only the undersampling and oversampling techniques underperformed other scenarios with CatBoost although, it is to be noticed, the low error that accompanies them. Of the four initial classifiers, in all cases, SVMSMOTE resampling method obtains better performance than its counterpart BorderlineSMOTE.

It is also interesting to notice that the overall response of Deep Neural Networks outperform Logistic Regression and Random Forest, although the boosted methods are shown to be better on the prediction task.

Figure 6: The illustration shows for the 20 most important features the AUC-ROC curve with the true-positive rate against the false-positive rate for the different models. LR: Logistic Regression; GBT: Gradient Boosting; CBT: CatBoost; NN: Neural Networks.

According to the explainable AI model pursued in the paper, a second experiment will compute the results using only 20 features selected using the feature importance calculated with the SHAP values built on the Catboost model.

Figure 6 presents the results after the dimensionality reduction. When comparing to the initial approach, the majority of models surprisingly maintained or insignificantly reduced its accuracy while introducing benefits, especially for the interpretation and training time.

Looking at the fourth column of Table 1, CatBoost method again outperforms the others, followed by Gradient Boosting. While boosted methods obtained the most eminent results, Neural Networks came third in the classification leaving RandomForest and Logistic Regression behind. Prominently, this column revealed the importance of choosing the right resampling method for each model. For instance, combining CatBoost with Class-weight can lead to the best GINI of 0.4577, while if BorderlineSMOTE is used instead, the GINI is 0.388, which falls behind Gradient Boosting and Neural Networks.

With a few exceptions, our results denote that decreasing the number of features did not affect the model’s performance. The most striking outcome emerging from the data is that while most of the models lower their performance after the reduction, CatBoost was able to maintain almost the same prediction when using Auto-weight resampling. Nevertheless, all the other models reduced their performance. Unexpectedly, for Neural Networks reducing the number of features lead to higher model performance.

Vi Explainability

Black box machine learning models have the major disadvantage of not being able to explain the rationale behind a prediction. The core intention in machine learning is about providing the best possible forecast, but in many real-life scenarios, there is also a need for useful information about the decision lipton2018mythos.

Explainability can be defined as giving human-understandable motivations of how given attributes of an individual are related to its model prediction 10.1145/3236009. While interpretability stands for providing some meaning to the human in a way it can be understood, explainability goes a step further by finding a human-comprehensive way to understand the decision-maker algorithm arrieta2020explainable. It can be distinguished between local and global. While the former explains the reasons for a specific decision on a single individual, the latter focuses on providing some meaning for the whole model’s logic to grasp the grand scheme of the algorithm 8466590.

In the Credit Scoring field, this topic is relatively new and particularly important, given the strict regulation on the topic especially in the European community such as European General Data Protection Regulation (GDPR) and Ethics guidelines for trustworthy AI Ethical. In addition to the regulatory issues, banks and financial institutions take into high consideration the chance of understanding the model reasoning: it allows to provide data-driven insights comprehensible by the human operators.

The explainability topic in Credit Scoring has been already tackled in previous works, using in particular the LIME framework ribeiro2016should and its new extensions guaranteeing more stability for the explanations visani2020statistical visani2020optilime. In this contribution, we propose a different line exploiting the SHAP algorithm: a tool assigning the importance level to each feature in the model, namely how much the variable contributes in achieving a good prediction.

SHAP (SHapley Additive exPlanations) NIPS2017_7062

is based on game theory, in particular on the Shapley values

shapley1953value. The original framework was developed to redistribute the gain of a cooperative game among players, in a rightful way. The same idea is borrowed in SHAP, where the aim is to decompose the prediction of the ML model among the features involved.


In Equation (5), represents the baseline while is the specific contribution of the feature to produce the ML prediction for the single individual whose feature vector is represented by .
Regarding the calculation, SHAP exploits the Shapley algorithm:


The idea is to consider an ML model with a restricted set of variables out of (complete set), not containing the -th feature. We evaluate the difference in prediction between the model using only the variables and the same model adding also the -th feature . The difference is attributed to the -th variable.
Ideally we should consider any possible set of features and average each difference in prediction caused by the -th variable. This is not feasable from a computational point of view, therefore SHAP performs a random sampling of the possible sets and computes the average on their prediction difference. Hence, we obtain an estimate of the -th variable importance.

This procedure works for any kind of ML model, at the cost of an elevated computation time.
A recent improvement of the SHAP technique consists in the TreeSHAP algorithm lundberg2018consistent, which provides exact calculation of the feature importances for Tree-based models. The key intuition is to exploit the tree structure, to calculate the importance of all the possible sets , in just a single pass. Along with the exact computation advantage, also the running time is drastically reduced.

SHAP is a local technique, since it decomposes a single prediction at a time. Although, there are ways to generalize the individual feature importances to the entire dataset, namely calculating the average importance of the variable among all the individuals. In doing so, we obtain a global feature importance, which quantifies the relevance of each variable for the ML model.

In this paper, the SHAP technique will be used to solve two different research questions: firstly we use SHAP global feature importance to rank the dataset variables in order to make feature reduction and keep only the 20 most relevant ones, secondly SHAP on single individuals is going to provide insights on the rationale of the final ML model.

From the results obtained from the comparison of the different models, CatBoost was selected for the interpretable machine learning approach with TreeSHAP. The model is fitted with the training data, leaving the test for SHAP predictions to establish the additive features.

vi.1 Global Interpretation

SHAP provides a set of useful plots for local and global interpretations of the feature’s contribution to the model.

Figure 7: The illustration shows the importance of each feature to the output magnitude.

The summary plot on Figure 7 gives the ranking for the absolute average feature’s impact on the model output. Of the 20 most important features, the minimum account balance (AB) reached, on the months before the client asks for the loan, carries the largest information. Closely following, the mean of the number of transactions in a window of 90 days and the account balance in the day the client asked for the loan come in second and third place respectively.

Figure 8: SHAP values for each feature (each dot represents the value for a single individual)

Figure 8 summarizes the Shap values for every feature for the whole dataset. The dots shown in each variable’s row represent the individual subjects in the dataset. On the axis the features are ordered by the mean absolute SHAP value. The horizontal axis collect the SHAP values reflecting the positive or negative impact of each feature on model prediction. A positive SHAP value increases the probability of the sample to be considered a bad customer while the opposite goes for the negative. Furthermore, a longer tail reflects a higher impact on the prediction. The colormap bar moves from blue to red as the feature value increases. High feature values on the positive side of the axis have a positive correlation with the dependent variable . Hence, there is a positive effect of that values to the prediction, promoting a bad account. For instance, in the feature (appendix), the higher the value of its samples, the higher the SHAP values and therefore, higher the probability of a bad individual. On the contrary, negative correlations push the instances to be considered as good customers. The and variables in the ranking ( and ) are an example of this negative influence. The higher the Shapley values for these variables, the higher the force they have to consider an account a good one.

Dependence plots are a tool that helps understand the global allocation of shap values to a particular account given the general behaviour. In Figure 9 we show the relation between the AB values and their Shapley values. As expected, the model assigns negative SHAP values, influencing the account to be a good customer when the AB on the calculation day is higher. The model gives more importance to the accounts close to zero, mostly for accounts with a small negative balance on that day. Surprisingly, for extremely negative AB the models change its behaviour and identify this account as more likely to be good than bad. However, looking at the (maximum account balance on the last 3 months), for an account with relatively high values (pink and purple samples), the SHAP values are negative or close to zero and predict a good customer, even if the (AB on the Calculation Day) is slightly negative. This behaviour can be seen in the range between where the blue points with small maximum balances have a higher impact on the decision while pink points, referring to higher maximum values, have zero or negative impact on the decision.

Figure 9: Scatter plot of the Shap values of the Account Balance on the Calculation Day against the Account Balance values coloured by the maximum value of the Account Balance in a 90 days window.

The selection of the 20 most important features can bring some knowledge about the characteristics the model found to be more useful for clustering the accounts. Except for the account balance day, all the other variables were created for all windows. Overall, the 30 days window before the Calculation day has been the most important one with an occurrence of 6 times over 19, followed by 90 days window with 5/19. Both the 30 days window of the second and third month appear 4 times, leaving the 60 days windows with 0 variables selected. Therefore, while averaging the three months is useful, having the data split into shorter periods brings even more relevant information, especially the closer it gets from the day they asked for the loan. A second remark refers to the type of feature. We can distinguish 3 different categories: KPIs referring to the number of transactions, the transaction balance or the account balance for a given day. Of the 20 variables, 14 were associated with the number of transactions, 7 to the account balance and only one to the transaction balance. In response to the type of operation performed to obtain the feature (min, max, mean), most significant features were related with operations looking for minimums, followed by the mean. Finally, looking at features mentioning positives or negatives calculations, no positive averages were selected while only 3 for the negatives. Hence, we can interpret that the negatives and minimum values were highly more relevant for the model to extract meaningful information.

vi.2 Local Interpretation

In this subsection, the four different types of possible accounts of the confusion matrix (TP, TN, FP, FN) are discussed.

vi.2.1 True Positive

The True Positive is an account of a bad customer that was correctly identified by the model as being bad. A random account from the test dataset has been chosen to exemplify the behaviour of the model showing a 0.57 probability to be considered bad. Figure 10 shows that the feature with the highest impact for predicting a bad customer was (appendix). In the second position, reflects the importance of the account’s balance on the day the clients asked for the loan. On the other hand, given that the (mean of the number of transactions per day on the last month) is just one, this feature impacts the model to consider the account a good one. However, as the influence is not enough, the features on the left have more power and succeed in the prediction task.

Figure 10: True Positive. Waterfall plot of a bad account categorized as bad by the model.

To explain the importance of the first feature on Figure 10, we will take advantage of the global perspective that a dependence plot offers. When features are strongly correlated we have to be careful about the interpretation of the feature importance. Considering that the interaction with other features increases the feature’s value, in Figure 11 we analysed how the SHAP and feature values relate in the circumstances where the interaction with other features are considered. A general look at the graph shows a pick at 0, meaning that the model gives importance to this feature when it finds values of the equal or close to zero. Probably, this is an indicator of a not very active account or one that doesn’t move large quantities, marking it as less trustful. When this feature moves away from zero, it becomes less important for the model to use it for its predictions, with a clear exception for values close to -2.000 € . Interestingly, this feature becomes irrelevant when the values differ from the ones explained. In this cases, the contribution to the prediction is usually negative, pushing the model towards a good customer.

Figure 11: Dependence plot of the , minimum value of the sum of transactions of the month prior to the Calculation Day.

vi.2.2 True Negative

The True Negative is an account of a good customer that was correctly identified by the model. The account on Figure 12 shows a minimum balance of € on the first month of the analysis ( feature), the 90 days window ( feature) as well as on the calculation day ( feature). This minima, together with a maximum AB of € and a mean of one transaction per day, are indicators that the customer is likely to be good with only a 0.05 probability of misbehaviour.

Figure 12: True Negative. Waterfall plot of a good account categorized as good by the model.

vi.2.3 False Positive

The False Positive is an account of a good customer that was incorrectly identified as bad. The account in Figure 13 is an interesting example of this type as it was wrongly classified with a predicted probability of 0.807. Some signs of this failed attempt could be the fact that the account achieved a minimum account balance of € ( feature), reporting up to 18 negative transactions in the same day ( feature) or a maximum value of the sum of transactions in a day of € ( feature). Therefore, the model choice was in correspondence with the numbers, even though the customer behaved in an unpredicted way given its transactions trends.

Figure 13: False Positive. Waterfall plot of a good account categorized as bad by the model.

vi.2.4 False Negative

The False Negative is an account of a bad customer that was incorrectly identified as good. In Figure 14 this client type is exemplified with (appendix) being the feature with a major impact toward what would have been a correct guess.

Figure 14: False Negative. Waterfall plot of a bad account categorized as good by the model.

Nevertheless, the same feature with one month shift ( feature) has the opposite effect, despite showing a similar amount of € . That is a clear example of a model choice that is not interpretable from the model’s output. Hence, we reported the SHAP dependence plot of these 2 variables to acquire more information.

Figure 15: Dependence plot between the Shapley values of the minimum account balance for the second month against the first month of a bad account categorized as good by the model.

Figure 15 explains how the model makes a prediction when considering more than one feature. The colourmap on the horizontal axes shows the magnitude of the minimum AB in the first month. Points on the right have a higher minimum account balance in the second months and, as expected, are coloured in pink meaning that they also have a high minimum for the following month. The same correlation is shown when we move towards the left side. Surprisingly, points with very high values of min AB don’t influence the model to make a decision, and therefore, the Shapley values are very close to zero. As we get closer to zero, we observe a peak with positive Shapley values allocated to these accounts influencing the prediction towards a bad client. However, this just pumps for the accounts where we find values close to zero in both months (with a purple colour). On the contrary, when one of the months turns blue, with extremely negative minimum while we move to the left extreme of the axes, the behaviour changes and the model predicts a good. That is very interesting inside as we may expect low AB values to be considered bad and high good, but the algorithm has learned that extreme AB is normally not associated with bad customers. In contrast, in the case of having extremely negative AB, it doesn’t hesitate to consider it a good customer. Looking at the result from a business side, we may argue that customers with a good bank history are more likely to get credit, be trusted by the bank and show this extreme behaviour. On the other hand, regular clients, with accounts almost at zero are less trustful.

Vii Conclusions

In this project, a machine-learning-based method for credit scoring has been suggested. To overcome the legal requirements that obligate financial institutions to explain the basis of every rejected loan application, we presented an explainable model that can break down a prediction by showing the impact of each feature.

Moreover, our research has highlighted the importance of choosing the adequate resampling technique to solve the unbalance of the dataset. We have managed to increase the GINI by a 7,5% by adopting the right method for a given model. We have provided further evidence that regardless of the number of features used, boosted models outperform Linear Models, Decision Trees and Neural Networks. Despite some inconsistencies between the AUC comparisons, with cross-validation, we confirmed the outstanding performance of Catboost over its boosted family-algorithms.

Our experiments coincide with previous results defending that boosted models can be more accurate than Neural Networks at the same time of being more interpretable than Linear Models lundberg2020local2global. Our research on Neural Networks suggests that it should be not considered as the prefered model without notably increasing the size of the dataset.

The strength of our work lies in the explainability section for which we used SHAP to interpret the model predictions from a global and local perspective. The findings are not transferable to all credit scoring models because they provide an adjusted understanding of the outcomes for the selected bank. Consequently, it revealed the importance of not building a universal answer and de-mystifies the assumption of a unique solution. One promising application of our technique would be to understand the bank-customer relationship. Not only by understanding the behaviour of an account but also, the behaviour of clients as a collective that interacts with a financial entity.

We acknowledge financial support by CRIF S.p.A and Università degli Studi di Bologna. We would like to thank Robert Lesi and Anna Boni for their valuable advice and ongoing collaboration with the experimental work.


Appendix A KPIs description

Code Definition
min_AB_30f Minimum account balance value achieved in the furthest months of the 90 days window.
mean_T_90 Mean of number of transactions in 90 days window.
AB_CalDay Account balance on the calculation day or the day the client asked for the credit.
days_neg_TB_30f Number of days with a negative transaction balance the furthest months of the 90 days window.
min_sumT_day_90 Minimum value of the sum of the transactions in a day in a 90 days window.
negT_30f Number of negative transactions in the furthest month of the 90 days window.
max_AB_30m Maximum account balance value in the second month before the the calculation day in the 90 days window.
min_sumT_day_30c Minimum value of the sum of the transactions one month before the the calculation day.
max_AB_90 Maximum account balance value in the 90 days window.
mean_T_30c Mean of number of transactions one month before the the calculation day.
min_AB_90 Minimum account balance value achieved in the 90 days window.
days_T_30c Number of days with transactions one month before the calculation day.
min_AB_30m Minimum account balance value in the second month before the the calculation day in the 90 days window.
max_sumT_day_30m Maximum value of the sum of the transactions in the second month before the the calculation day in the 90 days window.
mean_sumT_day_30f Mean value of the sum of the transactions’ value on a day in the furthest month in the 90 days window.
mean_sumT_day_30c Mean value of the sum of the transactions’ value on a day one month before the calculation day.
min_sumT_day_30f Minimum value of the sum of the transactions in the furthest month in the 90 days window.
days_negT_90 Days with negative transactions in the 90 days window.
mean_sumT_day_30m Mean value of the sum of the transactions’ value on a day in the second month before the the calculation day in the 90 days window.
days_T_30f Number of days with transactions in the furthest month in the 90 days window.