1 Introduction
Corporate credit rating is an assessment of the credit risk level of a company. Generally, it is issued by a credit rating agency such as Standard and Poor, Moody’s, or Fitch (S&P Global, 2018). (Standard and Poor’s Corporation, 1981). The rating expresses the agency’s opinion about a company’s ability to meet its financial obligation in full and on time. This rating serves as an aid to financial investors in order to assess various investment opportunities. Since the credit rating is supposed to be a uniform measure across companies, it enables investors to compare risk levels of companies which issue the financial instruments present in their portfolios.
Thus, credit rating is a very important measure for companies. Credit rating expedites the process of purchasing and issuing bonds by providing an uniform and efficient measure of credit risk (Akdemir and Karslı, 2012). Thus, instead of borrowing loans from banks, public companies are more likely to raise money from capital markets by issuing bonds and notes. A good credit rating is beneficial to companies. Bond yields are negatively related to credit rating (Luo and Chen, 2019). That is, a higher credit rating can help public companies raise funds for a lower repayment cost. Further, a good credit rating means that the company is less likely to default on their obligations, thus attracting riskaverse investors such as pension funds and mutual funds (Dittrich, 2007).
Recent literature is implementing machine learning and deep learning techniques to assess public corporations’ credit rating. In the US markets, corporate credit rating has been evaluated using Support Vector Machines (SVM’s), Tree based models and network learning methods such as Artificial Neural Network (ANN), Convolutional Neural Network (CNN) and Long short Term Memory (LSTM)
(Ye et al., 2008; Wallis et al., 2019; Hájek and Olej, 2011, 2014; Golbayani et al., 2020b; Wang et al., 2020). Deep learning techniques are popular to assess credit risk in European market and Asian market as well (Khashman, 2010; Kim and Sohn, 2010; West, 2000; Khemakhem and Boujelbene, 2015; Zhao et al., 2015; Addo et al., 2018).Even though machine learning and deep learning methods have achieved considerable accuracy for various types of classification problems (LeCun et al., 2015), and in particular for credit risk assessment (Golbayani et al., 2020a), the constructed neural networks continue to be treated as a blackbox method. This blackbox maps the input features into a classification output without lowlevel explanation (Chakraborty et al., 2017; Carvalho et al., 2019). However, it is very important for a model to provide visibility and interpretability. Which specific features are important and how these features impact the output. In finance, interpretability of the model is critical, as it is required by law. Financial regulation provides investors the right to receive an explanation of the algorithms used by investment firms (Protection Regulation, 2018; General Data Protection Regulation, 2016; Goodman and Flaxman, 2017).
Interpretable machine learning is a fast growing field that addresses this issue. It is defined as the use of machine learning or deep learning models to extract relevant knowledge about domain relations contained in the data (Murdoch et al., 2019). The interpretability of a machine learning method may be divided into: (1) model explanation, and (2) posthoc explanation. Model explanation means that the model is inherently interpretable and can generate explanations when trained (Yang et al., 2016). The posthoc explanation, refers to the capability of the model to generate an explanation based on existing decisions (Mordvintsev et al., 2015; Plumb et al., 2018).
In mathematics once a functional relationship between an input and an output is constructed^{1}^{1}1i.e., a deterministic function, one is able to determine the preimage of a set. That is, calculate defined as the set of inputs that take values in , where is a set in the codomain of . Generally speaking, a machine learning technique creates a relationship , where is provided by the ML technique used, and typically is categorical. Thus, given another input , a constructed ML technique is able to calculate the output of by simply calculating . However, providing the preimage of a specific category is hard. Often, the domain set is ill defined, and furthermore there is randomness in the ML technique. For instance, if a particular
has a 50/50 probability of being in category
respectively , then should the preimage of contain or rather should be in the preimage of ?In order to answer such questions, the counterfactual explanation was introduced for ML techniques (Wachter et al., 2017). Specifically, given a ML technique that associates an input with an output , if we want the value of to be how should the particular input be modified so that the value of the modified is
? The counterfactual explanation technique was applied to image recognition, healthcare and language models
(Goyal et al., 2019; Huang et al., 2019; Prosperi et al., 2020). The closest application of counterfactual explanation to finance we could find is to credit cards application with a binary black box classifier
Grath et al. (2018). Specifically, in the paper cited, the author use the counterfactual explanation to provide advice about how to change applications for credit cards in order to have a successful outcome. The authors solve a typical optimization problem using a Median Absolute Deviation (MAD) norm. In our work we are modifying the optimization problem by focusing on the sparsity of the counterfactual solution.We next discuss the challenges faced when calculating a counterfactual explanation. In a classification problem a high dimensional input is assigned through to a in a countable (often finite) set of outputs. Therefore, the ML “function” is not injective. This means that for each output there is a range of inputs which are mapped into it. Now look at Fig.1 which describes our credit rating problem. Obviously, the solution is not unique. That is, there are multiple ’s which will map into the new . To make matters even more complex the function is in fact “probabilistic”. For example, the same input is associated to output with probability , and with output with probability . Since the ML decided but in fact, the magnitude of the probabilities needs to be taken into consideration as well.
When studying corporate credit rating, there are two considerations worth mentioning. First, a financial statement contains a large number of features. For instance, the Compustat® dataset (Compustat, 2019), contains financial accounting variables collected from the original quarterly financial statements. It is impractical for a company to focus on changing all features. Second, some of the features may not be possible to change, for example: Comprehensive Income  Noncontrolling Interest, Equity in Earnings (I/S)  Unconsolidated Subsidiaries.
The purpose of this paper is to set up a proper optimization problem to address these issues. The central idea is to minimize the number of features modified for , where is the counterfactual explanation. In finance, particularly when applied to corporate credit rating, this allows company’s decision makers to focus their attention when trying to improve company’s credit rating. In section 2 we describe the optimization problem and propose an algorithm we call “the sparsity algorithm” to solve it. Section 3 presents experiment results obtained from both simulated data and real rating data.
2 Methodology
In this section, we describe the optimization problem and the algorithm introduced to solve this problem.
2.1 Statement of problem
As described in the introduction, the goal of this work is to discover the smallest subset of input data that can realistically be changed, so that the output of the model is reclassified for this changed input. To achieve this goal we propose solving a minimization problem.
Specifically, given a trained deep learning model which relates input variable with a specific classification , the problem is to find such that the response of is a different class than the response for . However, only certain components of can realistically be modified. Further, the problem attempts to find the smallest modification of which will accomplish the respective reclassification. Thus, we minimize the L0 “norm” of , and we impose a mask . The problem to solve is expressed mathematically as:
(1)  
s.t. 
Here,
is a mapping which counts the nonzero elements of a vector. This operator, often described as the L0 norm, is not actually a norm. However, it is used extensively in the Machine Learning area
(Shukla and Fricklas, 2018). is the Hadamard product that calculates the product elementwise of two matrices of the same dimensions. is a predefined vector with values 0 and 1, which masks the input components of which are not modifiable. is the desired output.With denoting the solution of the problem (1), the counterfactual explanation is (Wachter et al., 2017). In a credit rating problem, is typically taken as a one grade upgrade from the original credit rating , but in principle it may be any target rating.
Remark 1.
In this paper
is modeled using a MultiLayer Perceptron (MLP). MLP is the most prevalent network architecture for credit rating problems
Ahn and Kim (2011); Huang et al. (2004); Kumar and Haynes (2003); Kumar and Bhattacharya (2006). In credit rating applications the output layer ofcontains the distinct classes of the corporate credit rating. The error in prediction is obtained by applying a categorical crossentropy loss function on the output layer. A
GridSearch has been applied to find the optimal values of the MLP hyper parameters for our specific datasets.2.2 The Algorithm
The sparsity minimization problem is a well studied problem (Yuan and Ghanem, 2016; Cai et al., 2013; Zhang et al., 2020). However, the problem as written in (1) is still very challenging when the function is complicated as is the case of a deep neural network. The difficulty comes from L0 not being a norm, as well as from the function in the problem having a very complex form. In previous work Grath et al. (2018), the authors use Median Absolute Deviation (MAD) in order to impose sparsity. In our case their approach is not feasible for two reasons. First, MAD imposes sparsity by minimizing the size of the change of certain features (the features that are far from the median). In practice, this results in changing ALL components with some components having relatively small changes. For our finance applications, sparsity means that most components of the change have to be exactly . Second, the MAD weights used in the optimization are determined automatically from the dataset. In our application, some features cannot be modified. Thus, we need to define the problem in a way that will allow the algorithm to only modify prespecified features.
In our approach to solve (1), we replace with the norm. There are two reasons for this. First, the norm has been previously used as a regularizer, to increase sparsity (Bruckstein et al., 2009; Selesnick, 2017). Second, since it is a proper norm, we can rewrite the problem (1) as the following unconstrained optimization problem.
(2) 
Note that problem (2) treats the output of as a single number. However, as mentioned in the introduction, most Machine Learning methods take the decision based on a likelihood set of probabilities associated to each of the discrete outputs. To handle this issue, is replaced with the set of probabilities denoted (the output distribution). The output is replaced with the ideal probability set (Janocha and Czarnecki, 2017). We thus replace the first part of the loss function in equation (2) with the crossentropy (Kline and Berardi, 2005) in equation (3
). In this way, we can inform the managers how their entire credit rating probability distribution will be modified following the algorithm’s recommendation.
(3) 
Since the problem is now unconstrained we can use the gradient descent method to solve this problem. Gradient descent is a good way to solve such optimization problems when the objective function is convex and differentiable (Cauchy and others, 1847; Curry, 1944).
However, the solution of the unconstrained problem (3) is not necessarily sparse and also cannot guarantee that . This means that the counterfactual solution may not always produce a better credit rating. To solve these issues, we propose a new algorithm as follows.
Comparing with the solution obtained directly from equation (3), the sparsity algorithm produces a vector with a large number of components (a sparse vector). The algorithm accomplishes this task by following three main steps. First, the algorithm calculates the change ratio for each element in the output vector relative to the original vector. Second, it constructs candidate vectors, going less sparse from vector to vector . The algorithm repeatedly solves the problem by putting more and more importance on the boundary condition (). We end the process if there is at least one candidate solution which qualifies the input for the better rating . If there is no solution to the sparsity algorithm, we interpret it as the rating may not be changed in a simple way for the given input vector .
In the last loop there are two issues when returning the final value .
Remark 2.
If the in not unique and there are multiple solutions for example and , we can choose the final output based on whether or achieve a smaller value in equation (3) or we can chose the variant with the smallest number of nonzero coordinates.
Remark 3.
If we reach step and there is no solution then the rating of the company cannot be improved based on the existing credit rating model.
Remark 4 (Step 4 in the algorithm).
When a component of equals the relative change in step 3 for that component will be infinite. This forces the sparsity algorithm to favor choosing the features with values equal to . This step in the sparsity algorithm is introduced to set a ceiling for the ratio in order to resolve this issue. However, we will discuss another possible solution in Section 3.2.1 when we apply the algorithm to financial data.
2.3 What is the practical importance of the sparsity algorithm?
The idea of this work is simple. Given a learned algorithm , which associates a categorical (rating) to an input , can we find a change in so that the new rating associated to the changed (counterfactual) is now ? In this context, we call the distance () between the original input and the counterfactual input as effort. In the context of credit rating this calculates how much actual effort has to be put in changing the qualified rating. Having a sparser solution , may translate into a smaller effort to change while making sure that the output class of has been improved by at least one notch.
3 Empirical Results
We will be using two sets of data to demonstrate the validity of the proposed algorithm. First, we shall use synthetically generated data to illustrate the performance of the algorithm on a simple to understand case. Second, we use quarterly fundamental data obtained from the Compustat Database (Compustat, 2019). The fundamental data contains 332 accounting variables including balance sheet data, income statement data, etc. We use Standard and Poor’s credit ratings as the target rating . The case study 2 is the real financial study we wish to analyze.
In this work we are interested in answering three different questions.

Are the results of the sparsity algorithm intuitively correct when using the synthetically generated data?

Is it possible to improve credit rating with less effort then it actually happened in reality?

Does the effort to improve rating depend on the rating? Specifically, do we need to exert more effort when changing rating from noninvestment grade to investment grade than to change rating within the investment grade?
3.1 Case study: synthetically generated data
Since is the solution to a machine learning algorithm, in principle we could attempt to prove mathematically that the sparsity algorithm can solve the equation (1). The Lagrange multiplier version of the problem in equation (3) is well posed and the gradient descent will provide the optimal solution. The sparsity algorithm imposes constraints on the solution and it fundamentally is checking how close the solution is to the original problem (1). Thus the mathematical proof idea is to show that the sparsity algorithm produces an improvement at every step and that in the limit we obtain the solution of (1).
However, such proof would be dry and would only bring joy to mathematically inclined. We chose to follow a different approach. In this section we design an intuitive case study by synthetically generating data in such a way that would have an easy to understand solution. We compare the solutions obtained using the classical gradient descent and the solutions obtained using the sparsity algorithm. We perform matched pairs onesided t tests on the L0 and
norms to compare these solutions.We create a dimensional dataset with points
, where all of the features are normally distributed random variables. We let
and denote the important variables and we let , , and be noise variables. Specifically, the and variables are each a mixture of normals with means andand variance
. Their pdf is:The , and are iid normally distributed with mean and variance .
Figure 2 shows the projection of the synthetically generated points on the first two coordinates. We can clearly see the centers of the classes. In this synthetically generated data, we arbitrarily define blue points as rating 1, orange points as rating 2, green points as rating 3, and red points as rating 4 (counterclockwise starting from the first quadrant). We make the convention that ratings is the best, decreasing with being the worst. In this experiment, we aim to improve the rating of the points using the smallest effort .
The point of this synthetically generated case study is to showcase the results of the algorithms in a context where we can plot and actually see the results.
3.1.1 Results obtained when using the synthetically generated data
As mentioned, we want to determine which coordinates need to be changed to “improve the rating”. In this simple exercise, for a point this translates into determining the “best” that will improve the class number. To illustrate the performance of the algorithm we pick 3 points (one from each class 4,3,2 respectively) which showcase the largest difference between the two algorithms used. Table 1 presents the coordinates of the 3 points chosen and the arrows on figure 3 show the counterfactual point in the improved rating class. The purple arrow is the from the graduate descent, while the yellow arrow depicts the sparsity algorithm result. Table 1 gives the numerical values of the and shows that the ratings are improved successfully.
The gradient descent solution “improves” the class by changing all coordinates. The largest changes are in the first two coordinates, as they should, while the remaining three coordinates are just noise. Compared to the solution from the gradient descent, the sparsity algorithm solution removes the “noise” from features. It picks the relevant coordinate to be changed every time.
However, we also showcase an exception (point 3). For this point the rating indeed improves from to . However, the algorithm picks the feature to change in addition to . This is due to the fact that the relative change ratio is larger for than for just by chance. The sparsity algorithm checks the result for which does not change the rating, then at the next iteration it settles on the solution.
This point 3 is one of few exceptions we observed in our results. It actually illustrates an issue we will observe in the next case study dealing with real data.
Rating  

original vector  0.6019  0.4742  0.0827  0.0595  0.0588  4 
GD solution ()  0.767  0.5539  0.0179  0.0095  0.012  3 
Optimized Algo ()  0.767  0  0  0  0  3 
original vector  0.5488  1.0176  0.1723  0.2329  0.4329  3 
GD solution ()  0.3963  1.0276  0.0133  0.0149  0.0074  2 
Optimized Algo ()  0  1.0276  0  0  0  2 
original vector  1.3814  0.5363  0.0031  0.2783  0.074  2 
GD solution ()  1.5275  0.4331  0.0054  0.0134  0.0071  1 
Optimized Algo ()  1.5275  0  0.0054  0  0  1 
This study primarily focuses on the L0 norm of as a measurement of the effort defined in section 2.3. Recall that the objective of our problem in equation (1) is to increase the sparsity of the solution. However, the norm may be viewed as another measurement of effort as it calculates the total ‘distance’ between the original point and the target.
To formally compare the solution from the gradient descent with the solution from the sparsity algorithm we perform matched pairs onesided t tests as follows:
L0 testing
algorithm is equal to  
algorithm is less than 
L2 testing
algorithm is equal to  
algorithm is greater than 
We use all the points in the dataset to perform these tests. We treat each change separately  from to , to , and to , respectively. The average L0 and for each group and the results for the matched pairs ttests are presented in Table 2.
From these results it is evident that the solution from the sparsity algorithm is significantly smaller than the solution from the gradient descent, i.e., requires less “effort”.
L2 from  L2 from  L2_diff  L0 from  L0 from  L0_diff  

sparsity  GD  sparsity  GD  
2 to 1  1.15182  1.15409  0.00227 (0.00024)  1.15255  5.00000  3.84745 (0.00693) 
3 to 2  1.07638  1.07827  0.00189 (0.00022)  1.16672  5.00000  3.83328 (0.00728) 
4 to 3  1.19610  1.20001  0.00390 (0.00040)  1.18393  5.00000  3.81607 (0.00758) 
3.2 Case study: Quarterly financial statement data
A description of the Financial statement data used
In this section we apply the sparsity algorithm to data obtained from financial statements. Given a particular financial statement, there may be many ways in which to improve the financial stability of a company and thus increasing its credit rating. In this work, we are trying to provide a data driven answer which is based purely on the machine learning technique used. To this end, we have to assume that the machine learning technique used to determine the original is very accurate. Recall that the counterfactual problem we are focusing on, is defined for a given .
We apply the methodology to companies chosen from 3 sectors of the US economy: Financial, Healthcare, and Information Technology (IT). We first clean the data by removing features which are not reported for each of the specific sectors. The data is thus reduced to around 300 variables for each sector (294, 296, 296 respectively). Next, we define in equation (3) by analyzing all remaining features for each sector and determining whether each feature can be feasibly changed. More precisely, certain accounting variables may not be changed because of contractual obligation, unpredictable events, related to tax, city governance, etc. Table 3 groups all the reasons we found as to why accounting variables may not be changed in practice. The table also lists one accounting variable as an example for each of the reasons. A complete list of accounting variables that we found hard or impossible to change is presented in Table 11 of the Appendix 5. The remaining number of variables that are not masked by is , , and respectively for Healthcare, IT and Finance sectors.
Reasons  Example 

Scheduled items  Pension Plan 
Assets are discontinued operations  Extraordinary Items and Discontinued Operations 
Intangible asset  Good Will 
Special items  Costs of Failed Acquisitions 
Regulated items  Tier 1 Capital Ratio 
Agreements with shareholders, employees  Deferred Compensation 
Computational Items  Depreciation & Amortization 
Special events  Loss from Flood/Fire 
Loss/gain from subsidiary  Equity in Earnings (I/S)  Unconsolidated Subsidiaries 
Nonoperating items  Gain/Loss on Sale of Property 
3.2.1 Question 2: Comparing the results of the algorithms with quarters when companies changed ratings.
It is simple to visualize the results of the algorithms for the synthetically generated data. For real data, the clusters are hard to visualize, but the algorithms works in a similar way. In most cases we are able to determine a which improves rating using either the gradient descent (GD) or the sparsity algorithm. However, how relevant is this ? Suppose we find a company to listen to our advice and for the next quarter the company places resources towards changing the variables indicated by . If the targets are reached would the company improve its rating during the next quarter?
This is of course a question hard to answer. In an attempt to answer it, we focus on those quarters and companies whose ratings actually improved during the next quarter. We apply GD and the sparsity algorithms to the statements from the quarters before the rating change. To assess the effectiveness of the proposed changes we calculate the actual Real change between the two consecutive quarters when the ratings improved. Table 4 presents these values in columns 1 and 2. For example, the L0 number for the Healthcare sector Real change is calculated by looking at how many features changed between the two consecutive quarters when the rating of the company went up. We display the average number of features changed for all companies in the healthcare sector which went up in ratings. We compare this real change “effort” with the proposed changes by the two algorithms in columns 3 and 4 of the table. The numbers in these columns are calculated using only the data from the quarters before the rating changed. Mathematically, they are calculated as the respective norm of . Since both Gradient Descent (GD) and Sparsity algorithms only change a selected number of features (the unmasked features), for a proper comparison in table 4 we calculate the Real change only for the features that can be changed (column two). Similar numbers are calculated for all sectors.
Real change  Real change for relevant variables  Change for GD  Change for Sparsity  Match Rate  

Healthcare  L0  113.84  59.02  87.00  53.82  85.43% 
L2  4744.44  4263.46  6021.42  4615.10  
IT  L0  119.13  61.95  87.00  60.12  87.41% 
L2  12550.07  11774.29  2727.35  2057.80  
Financial  L0  101.10  48.39  86.00  57.24  76.58% 
L2  65607.00  46018.43  11474.30  7591.71 
The last column in the table 4
is labeled Match Rate. For each company that changed rating we look at the features suggested to be changed by the sparsity algorithms. We calculate what percentage of them were actually changed in the real statements between the two quarters when ratings improved. A high match rate indicate that the features selected by the sparsity algorithm are similar to the changed features in the real statements. It is worth mentioning (again) that the sparsity algorithm comes up with these features based solely on the data from the quarter BEFORE the ratings changed.
However, to obtain the numbers in the table 4 we implement the sparsity algorithm with a different Step 4 mentioned in remark 4. Specifically, step 4 is: “we set when ”. With this change the sparsity algorithm essentially ignores those feasible features in the original statement whose value is . We do this inspired by the synthetic data in the previous section. Recall the point 3 which by chance had the 3rd coordinate with a large relative change. A similar phenomenon is happening in the real statement data when there are ’s present in the unmasked set of features. The algorithm focuses on them as the relative change from is technically infinite. In fact, in the real statements some of those features do change, and that is probably why we aren’t able to capture 100% of the changes.
Looking at L0 (the number of changed features) the results are consistent. In the two quarters data the average number of features changed is between and , while the number of relevant features changed is about half the total number. The gradient descent changes all features and in fact only the distance is relevant for it. However, the sparsity algorithm produces a number of features to be changed which is similar to the real number. Although it is nice to see that we recover most of the features that actually changed this is not our goal, as we want to identify the smallest number of changes possible with a minimum effort. By neglecting the features with a value, the algorithm is probably ignoring features that might be very important to improve credit rating. This is why we are replacing the step 4 in the sparsity algorithm with “We set when the component ”, as it was in fact written in the actual algorithm. We rerun this algorithm and present the results in Table 5.
Real change  Real change for relevant variables  Change for GD  Change for Sparsity  Match Rate  

Healthcare  L0  113.84  59.02  87.00  22.93  61.01% 
L2  4744.44  4263.46  6021.42  5239.12  
IT  L0  119.13  61.95  87.00  24.73  52.59% 
L2  12550.07  11774.29  2727.35  1925.80  
Financial  L0  101.10  48.39  86.00  33.46  37.35% 
L2  65607.00  46018.43  11474.30  7962.78 
Including the features in the set of possible changes helps the sparsity algorithm reduce its L0 norm. However, the match rate of the sparsity algorithm drops to around . Considering the purpose of the algorithm is to identify relevant features for improving ratings the sparsity of the is important. In terms of the norm, we note that the magnitude of change (“effort”) is reduced dramatically in the Finance and IT sectors but it is in fact increased on the average in the Healthcare sector. We believe this it normal as focusing on improving the sparsity of the solution, the feasible domain would be reduced. Thus, to qualify the solution for an improved rating, we have to exert more effort on those feasible features. This may cause an increase in the norm of the solution.
Comparatively, we observe that the companies in the financial sector need to exert more effort to improve their credit rating than companies in the IT and healthcare sectors.
Why two sparsity algorithms?
Generally, published articles do not detail all attempts and only showcase the best, which is typically the last algorithm. In this article, we chose to present a variant of the algorithm which we initially employed as well as the final algorithm version. We decided to do this as we are dealing with real data between two quarters, and our algorithm is dependent on how well the MLP is performing. Thus it is important to validate the features we obtain from using the algorithm on the previous quarter with the features actually changed in the next quarter. This match is an argument that our algorithm catches the relevant changes as well as that is producing accurate results.
This is also important for the final algorithm results in Table 5. Taken by itself the sparsity algorithm results are academic. However, when corroborated with the results in Table 4 which show it is possible to match the real changes, we think the results point to a valid way to potentially produce an improved rating during the next quarter.
3.2.2 Comparing the effort needed to improve ratings from different levels
In this section, we implement the sparsity algorithm to all observations in the dataset. We calculate the “effort” needed for a company during a particular quarter to improve during the next quarter. We aggregate the results by the specific rating and sector. We want to investigate how the effort changes depending on the ratings the company is at the time of the respective quarter. For example, is it harder to improve rating from the highest non investment grade (BB+) to an investment grade (BBB) then it is from other ratings?
S&P rating  Rating description  
AAA  Extremely strong capacity to meet its financial commitments  Investment grade 
AA+  Very strong capacity to meet its financial commitments  
AA  
AA  
A+  Strong capacity to meet its financial commitments  
A  
A  
BBB+  Adequate capacity to meet its financial commitments  
BBB  
BBB  
BB+  Has inadequate capacity to meet its financial commitments  Non investment grade 
BB  
BB  
B+  Has the capacity to meet its financial commitments  
B  
B  
CCC+  Substantial risks  
CCC  Extremely speculative  
CCC  Default imminent with little prospect for recovery  
CC  
C  
D  In default 
For reference in Table 6 we present the Standard & Poor’s classification and ratings interpretation. The higher the rating, the lower the interest rate the company has to pay. Furthermore, having an investment grade rating means that the pool of investors is enlarged considerably as government regulations prevent pension funds and mutual funds from purchasing noninvestment grade bonds. In fact, if any of their holdings drops below BBB, the pension funds are required to sell, often at a loss.
IT  Healthcare  Financial  

ori  curr  L2  L0  Count  L2  L0  Count  L2  L0  Count 
AA+  AAA  108400.3  27.3  15  11183.0  30.1  13  
AA  AA+  94786.6  52.0  47  174127.2  62.7  128  41469.8  34.7  85 
AA  AA  59565.2  38.5  59  4879.0  37.2  126  37547.4  43.2  382 
A+  AA  23463.1  41.8  297  4854.6  28.1  268  6051.8  35.8  641 
A  A+  3482.2  31.1  194  2668.8  21.2  351  15756.5  31.1  630 
A  A  1372.4  28.1  217  2892.9  29.6  192  6755.0  29.1  644 
BBB+  A  2025.7  32.4  234  1096.2  25.8  308  4483.3  31.0  421 
BBB  BBB+  1250.3  28.1  342  1598.6  24.5  343  2492.2  31.8  353 
BBB  BBB  386.7  22.1  195  1169.3  21.9  244  2886.5  27.8  212 
BB+  BBB  1474.1  32.9  180  602.3  23.8  125  1637.3  36.5  80 
BB  BB+  864.0  25.9  74  567.5  23.6  182  6191.6  47.0  19 
BB  BB  1121.8  29.0  146  262.6  16.4  90  4243.2  47.7  11 
B+  BB  411.0  28.9  112  1344.1  20.5  45  1943.0  43.1  20 
B  B+  402.6  16.6  39  517.9  46.7  15  4176.9  63.5  2 
B  B  950.7  33.9  24  1817.3  34.5  15  
CCC+  B  1033.3  36.0  12  1033.4  41.1  9  
Average  8735.4  31.3  11228.0  27.1  11288.6  33.2 
Table 7 presents the L0 and averages for the sparsity algorithm as well as how many observations were in each category (“Count” column). We observe no clear pattern to indicate that the results fit with the ratings in table 6. We generally note that the “effort” needed increases when ratings are increasing. This suggests that it may be easier to increase in rating from say BB+ to BBB than it is to increase from A+ to AA.
However, we need to point out an issue that arises when we aggregate all these companies. Specifically, a particular credit rating says that a company is in a range of risk levels, it is not providing a specific value of risk for that company. Thus, the actual risk value for companies within the same credit rating score may be different. For example, a company XXX may be at one extreme of ranges for the AA ratings, while company YYY may be at the other extreme and still rated AA. It is obviously more difficult for one company to improve its ratings than it is for the other company. The results in the Table 7 contain all the companies for a particular rating range and it calculates an aggregate average effort. It may be impossible or very hard for a company that just improved its rating to go to an even better rating the following quarter. In the sparsity algorithm, in equation (3) controls this “difficulty”. A higher indicates this particular company is harder to improve its rating level. Thus, the “effort” needed may be larger.
Lambda  AAA  AA+  AA  AA  A+  A  A  BBB+  BBB  BBB  BB+  BB  BB  B+  B  B 

0.1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0 
5  0  0  0  0  0  0  0  0  0  0  8  0  0  0  0  0 
10  0  0  0  0  2  6  2  0  0  0  7  4  0  18  0  0 
50  0  0  1  2  49  159  135  122  105  69  42  72  85  21  22  9 
100  0  0  0  22  55  47  46  114  81  63  10  51  19  0  1  1 
200  0  0  11  65  46  5  29  80  7  29  6  18  8  0  1  2 
500  0  0  5  95  17  0  17  24  1  11  1  1  0  0  0  0 
1000  0  5  6  69  22  0  5  2  0  7  0  0  0  0  0  0 
10000  15  41  36  43  3  0  0  0  0  1  0  0  0  0  0  0 
100000  0  1  0  1  0  0  0  0  0  0  0  0  0  0  0  0 

Higher lambda means more effort needs to be exerted to improve rating.
Lambda  AAA  AA+  AA  AA  A+  A  A  BBB+  BBB  BBB  BB+  BB  BB  B+  B  B 

10  0  0  0  0  1  0  4  0  0  0  0  0  0  0  0  0 
50  0  1  2  54  82  85  94  63  95  6  5  0  2  0  3  2 
100  0  7  10  106  123  82  48  82  79  5  1  0  4  0  0  3 
200  1  1  52  118  140  127  82  89  13  17  1  0  9  1  8  4 
500  9  6  108  183  135  125  99  76  25  43  9  8  4  1  4  0 
1000  3  3  61  84  37  104  42  27  0  9  3  3  1  0  0  0 
10000  0  40  122  87  72  90  42  15  0  0  0  0  0  0  0  0 
100000  0  27  27  9  40  31  10  1  0  0  0  0  0  0  0  0 

Higher lambda means more effort needs to be exerted to improve rating.
Lambda  AA+  AA  AA  A+  A  A  BBB+  BBB  BBB  BB+  BB  BB  B+ 

5  0  0  0  0  0  0  0  0  0  1  1  0  0 
10  0  1  0  0  0  0  0  0  0  0  0  0  0 
50  0  7  36  18  11  63  45  79  23  34  23  20  11 
100  0  6  50  78  32  107  53  74  54  57  46  16  3 
200  0  26  73  63  62  66  83  43  36  68  20  9  1 
500  0  49  62  100  45  55  110  47  12  22  0  0  0 
1000  1  35  29  44  34  14  34  1  0  0  0  0  0 
10000  63  2  18  47  8  3  14  0  0  0  0  0  0 
100000  64  0  0  1  0  0  4  0  0  0  0  0  0 

Higher lambda means more effort needs to be exerted to improve rating.
Tables 8, 9, and 10 present in each row the numbers of companies that successfully improved their current rating, for a particular value. The tables are split by sector. We can see that as the values increase more companies improve their rating. Recall our range of rating scores assertion. We interpret the values in the tables as the companies that are closer to the threshold (smaller lambda) are improving easier and thus are in an upper row in the tables.
Looking at all the three tables we see the numbers shifting to left as increases. This is consistent with our previous observations in table 7. Indeed, the results seem to indicate that a lower is needed (thus a lower effort) for the majority of the lower rated companies. In contrast a larger value is needed for the majority of the high rated companies to improve their score. Thus, as the rating of the company gets better, a much larger effort is needed to further improve its ratings.
4 Conclusion
In this work we propose a sparsity algorithm that finds a counterfactual explanation for the credit rating problem. The sparsity algorithm is designed to discover the least amount of changes to be made to a particular financial statement variables that has a large probability of improving the prediction to a predefined credit rating.
We apply the sparsity algorithm to a synthetically generated dataset as well as to quarterly financial statements data. Our toy case study, using synthetically generated data, shows that the sparsity algorithm can successfully change points to the target class, with less “effort” then the solution obtained using a gradient descent method. The results obtained using quarterly financial statements confirm that the sparsity algorithm may be employed to significantly reduce the “effort” to improve corporate credit rating. More importantly, when analyzing quarterly statements before an actual rating increase we show that the sparsity algorithm captures the majority of features that in fact will have changed in the next quarter statement. This result gives us confidence to propose the final algorithm which results in an even more focused recommendation to the corporation’s managers.
Finally, we find that the “effort” required to improve the credit rating is positively related to the credit rating level. Specifically, improving credit rating for A rated corporations is much harder than improving credit rating for B level companies.
Acknowledgment
The authors would like to acknowledge the UBS research grant awarded to the Hanlon Laboratories which provided partial support for this research. We want to acknowledge Bingyang Wen who provided helpful discussions about the algorithm. We also acknowledge Professor Zachary Feinstein who suggested the use of the norm in the proposed optimization problem.
References
 Credit risk analysis using machine and deep learning models. Risks 6 (2), pp. 38. Cited by: §1.
 Corporate credit rating using multiclass classification models with order information. World Academy of Science, Engineering and Technology, International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering 5 (12), pp. 1783–1788. Cited by: Remark 1.
 An assessment of strategic importance of credit rating agencies for companies and organizations. ProcediaSocial and Behavioral Sciences 58, pp. 1628–1639. Cited by: §1.
 From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM review 51 (1), pp. 34–81. Cited by: §2.2.
 Exact topk feature selection via l2, 0norm constraint. In Twentythird international joint conference on artificial intelligence, Cited by: §2.2.
 Machine learning interpretability: a survey on methods and metrics. Electronics 8 (8), pp. 832. Cited by: §1.
 Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris 25 (1847), pp. 536–538. Cited by: §2.2.
 Interpretability of deep learning models: a survey of results. In 2017 IEEE smartworld, ubiquitous intelligence & computing, advanced & trusted computed, scalable computing & communications, cloud & big data computing, Internet of people and smart city innovation (smartworld/SCALCOM/UIC/ATC/CBDcom/IOP/SCI), pp. 1–6. Cited by: §1.
 Compustat online manual. Standard & Poor’s. Cited by: §1, §3.
 The method of steepest descent for nonlinear minimization problems. Quarterly of Applied Mathematics 2 (3), pp. 258–261. Cited by: §2.2.
 The credit rating industry: competition and regulation. Ph.D. Thesis, Universität zu Köln. Cited by: §1.
 Regulation eu 2016/679 of the european parliament and of the council of 27 april 2016. Official Journal of the European Union. Available at: http://ec. europa. eu/justice/dataprotection/reform/files/regulation_oj_en. pdf (accessed 20 September 2017). Cited by: §1.

A comparative study of forecasting corporate credit ratings using neural networks, support vector machines, and decision trees
. The North American Journal of Economics and Finance 54, pp. 101251. Cited by: §1.  Application of deep neural networks to assess corporate credit rating. arXiv preprint arXiv:2003.02334. Cited by: §1.
 European union regulations on algorithmic decisionmaking and a “right to explanation”. AI magazine 38 (3), pp. 50–57. Cited by: §1.
 Counterfactual visual explanations. In International Conference on Machine Learning, pp. 2376–2384. Cited by: §1.
 Interpretable credit application predictions with counterfactual explanations. arXiv preprint arXiv:1811.05245. Cited by: §1, §2.2.

Credit rating modelling by kernelbased approaches with supervised and semisupervised learning
. Neural Computing and Applications 20 (6), pp. 761–773. Cited by: §1.  Predicting firms’ credit ratings using ensembles of artificial immune systems and machine learning–an oversampling approach. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 29–38. Cited by: §1.
 Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064. Cited by: §1.
 Credit rating analysis with support vector machines and neural networks: a market comparative study. Decision support systems 37 (4), pp. 543–558. Cited by: Remark 1.
 On loss functions for deep neural networks in classification. arXiv preprint arXiv:1702.05659. Cited by: §2.2.
 Neural networks for credit risk evaluation: investigation of different neural models and learning schemes. Expert Systems with Applications 37 (9), pp. 6233–6239. Cited by: §1.
 Credit risk prediction: a comparative study between discriminant analysis and the neural network approach. Accounting and Management Information Systems 14 (1), pp. 60. Cited by: §1.
 Support vector machines for default prediction of smes based on technology credit. European Journal of Operational Research 201 (3), pp. 838–846. Cited by: §1.
 Revisiting squarederror and crossentropy functions for training neural network classifiers. Neural Computing & Applications 14 (4), pp. 310–318. Cited by: §2.2.
 Artificial neural network vs linear discriminant analysis in credit ratings forecast: a comparative study of prediction performances. Review of Accounting and Finance 5 (3), pp. 216–227. Cited by: Remark 1.
 FORECASTING credit ratings using an ann and statistical techniques.. International journal of business studies 11 (1). Cited by: Remark 1.
 Deep learning. nature 521 (7553), 436444. Google Scholar Google Scholar Cross Ref Cross Ref. Cited by: §1.
 Bond yield and credit rating: evidence of chinese local government financing vehicles. Review of Quantitative Finance and Accounting 52 (3), pp. 737–758. Cited by: §1.
 Inceptionism: going deeper into neural networks. Cited by: §1.
 Interpretable machine learning: definitions, methods, and applications. arXiv preprint arXiv:1901.04592. Cited by: §1.
 Model agnostic supervised local explanations. arXiv preprint arXiv:1807.02910. Cited by: §1.
 Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nature Machine Intelligence 2 (7), pp. 369–375. Cited by: §1.
 General data protection regulation. Intouch. Cited by: §1.
 Guide to credit rating essentials: what are credit ratings and how do they work. Standard & Poor’s Financial Services New York. Cited by: §1.
 Sparse regularization via convex analysis. IEEE Transactions on Signal Processing 65 (17), pp. 4481–4494. Cited by: §2.2.

Machine learning with tensorflow
. Manning Greenwich. Cited by: §2.1.  Standard & poor’s guide to credit rating essentials. Standard & Poor’s. Cited by: §1.
 Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harv. JL & Tech. 31, pp. 841. Cited by: §1, §2.1.
 Credit rating forecasting using machine learning techniques. In Managerial Perspectives on Intelligent Big Data Analytics, pp. 180–198. Cited by: §1.
 Is image encoding beneficial for deep learning in finance?. IEEE Internet of Things Journal. Cited by: §1.
 Neural network credit scoring models. Computers & Operations Research 27 (1112), pp. 1131–1152. Cited by: §1.
 Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 1480–1489. Cited by: §1.
 A multiclass machine learning approach to credit rating prediction. In 2008 International Symposiums on Information Processing, pp. 57–61. Cited by: §1.
 Sparsity constrained minimization via mathematical programming with equilibrium constraints. arXiv preprint arXiv:1608.04430. Cited by: §2.2.
 Topk feature selection framework using robust 01 integer programming. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2.2.
 Investigation and improvement of multilayer perceptron neural networks for credit scoring. Expert Systems with Applications 42 (7), pp. 3508–3516. Cited by: §1.