1 Introduction
Machine learning models are visible today in a wide variety of fields. From the field of medicine to data mining, speech recognition to human computer interaction, financial investment to risk management, these complex and hardtointerpret models outperform traditional models by utilizing intricate algorithms for predictions. Explaining these complex models is crucial not only to understand model predictions but also facilitate in explaining which features are prime contributors to the prediction. Recently, Shapley value has become a popular way to explain the predictions of machine learning models due to a series of desirable theoretical properties which distinguish it from other explanation methods such as LIME [lundberg2017unified][ribeiro2016should]. Game theoretic Shapley value describes a way to distribute total gains of the cooperative game among players in such a way that it satisfies few desirable notions of fairness [shapley195317][young1985monotonic]. Shapley value for a player is equal to average of marginal contribution of player over all possible ways in which collation can be formed. In Machine learning models, Shapley value of a feature equals the average marginal contribution of a feature over all possible permutations of features.
Shapley value imparts a lens to understand convoluted machine learning models by providing marginal contribution of each feature. This value not only depends on the feature values and prediction function but also on the distribution of data. Although we get machine learning model explanation using Shapley value but what are the factors that influence Shapley value? How does distribution of a feature influence its Shapley value? How does Shapley explanation vary for different predicted outcomes from the same model (e.g. logit model has three output logodds, probability and binary decision such as accept vs reject)? To the best of our knowledge, these questions are still unexplored in literature.
In general these questions are difficult to answer as Shapley value does not have a closed form solution and numerical estimates for Shapley explanation are computationally expensive. Thus, in this paper we tackle these questions for linear probability model (logit/probit). A linear probability model such as logit regression has three outcome types logodds, probability, and binary decision. These outcomes can be used for different business purposes. For example, a bank can use probability outcome to estimate expected default rate or use binary decision outcome of the same model to accept or reject a credit application. It is important to know whether Shapley explanation for all the different outcomes of a model are aligned. If not, how are they different? The rest of this paper is organized as follows. Section
2 covers description of the Logit and Probit model with different outcomes that can be generated using the same model. In Section 3, we have provided a closed form solution for the Shapley value and have discussed factors influencing these values. Section 4 gives us a snapshot of disagreements that arise between the Shapley value for different outcomes. Section 5 contains information about how disagreements are influenced by the variance of a distribution. Section 6 conveys information about global importance of a feature for different outcomes. Finally, Section 7 holds some concluding remarks.2 Logit and Probit Model
Logit and Probit models are commonly used in industry and academics to predict binary dependent variable using explanatory variables . Given the coefficient of explanatory variables , the probability of is given by following expression
where, is standard logistic function in case of logit regression and standard normal function in case of probit regression.
Let, equals . We can interpret as log odds^{1}^{1}1log odds equals to
for logit model and distance from mean in standard deviation unit for probit model. In general, we use probability models for classification or binary decision making by setting a threshold either on
or . Without loss of generality, assume is the threshold for binary decision making. Using , we can get equivalent threshold in a probabilityWe can interpret the logit/probit model outcome in three different ways:

LogOdds or Standard Deviation Unit (): Outcome can be interpreted as logodds and distance from mean in standard deviation unit for the logit and probit models, respectively. Since is linear in it is easy to interpret and Shapley value has a closed form solution based on and mean of a feature, irrespective of distribution of explanatory variables for a given sample.

Probability (): The prediction can also be interpreted in terms of probability which is extensively used in literature and industry. Probability prediction outcome captures changes in more effectively than log odds outcome because all changes in will not uniformly translate to probability.

Binary Decision Making ( or ): This interpretation of logit/probit outcome is directly related to binary decision making scenario. For example: Acceptance or rejection of credit card application.
3 Shapley Value
Shapley value method was initially introduced in game theory to describe a way to distribute the total gains of the cooperative game among players which satisfies few desirable notions of fairness. Shapley value for a player is equal to the average of marginal contribution over all possible ways in which collation can be formed. Strumbel & Kononenko
[strumbelj2010efficient][strumbelj2014explaining] and Lundberg & Lee [lundberg2017unified] described a way to use Shapley value for machine learning model explanation.Let be a machine learning model, where represents the data with features. For a sample , Shapley values for the features denoted by are additive, i.e.,
where, equals the expected prediction value i.e., . Thus, Shapley value explains the difference between prediction of the sample and the global average. Shapley explanation for a feature is estimated in the following way.
where, equals to the expected predicted value using features i.e., .
In the case of two features and , Shapley value expressions are:
3.1 Shapley Value for Linear Probability Model
For simplicity, we assume two independent^{2}^{2}2Under the independence assumption both interventions and conditional expectation are same, this isolates the difference arising from different notation of expectation
and normally distributed features
and with mean and variance . We denote value function and Shapley value with and respectively where superscripts , , and denote logodds (or standard deviation unit), probability and binary decision outcomes, respectively and subscript 0 denotes baseline expectation (expected output); 1 and 2 stand for features.Since both explanatory variables and are normal independent variables, the conditional distribution of or is
Above expressions illustrate that conditioning on features, changes the mean of and reduces variance (reduction in uncertainty over ).
In Appendix A
, we have derived expressions for the value function and Shapley value. For logit regression, value function for the probability outcome requires estimation of logistic function (also known as sigmoid function, we will denote this by
) over a normal distribution, i.e., where is normally distributed but it does not have a closed form solution. Thus, we use standard normal distribution approximation for the logistic/sigmoid function given by the expression below [tocher][dombi2018approximations].Figure 1 depicts binary and probability outcomes as a function of . It also portrays that the standard normal approximation of logistic function is close to actual logistic function.
When both features are relevant (slope coefficients are nonzero), we can transform/normalize the data in such a way that features have 0 mean and their regression coefficients are 1 with nonnegative intercept. Thus, without loss of generality, we assume , and . We also assume . Under these conditions and (for comparison), Shapley value expressions derived in Appendix A are shown below
Shapley value for :
(S1)  
Baseline Shapley value for
(logodds or standard deviation units) is equal to intercept whereas Shapley value for a particular feature is linear in its value and does not depend on the value of the other features. Since, we have linearly transformed
^{3}^{3}3 for the variable to have mean 0 and slope coefficient 1, Shapley value for a feature has same sign as slope coefficient whenever feature value is above its mean. Shapley value for a feature increases in magnitude as we move away from its mean. Figure 2 illustrates Shapley value for feature 1 as a function of and for log odd output. We can observe that the level curves for are linear in and do not depend on .Shapley value for Probability ():
(S2)  
In (S2), equal to 1 in case of Probit model () and in case of the Logit model (). It is not reasonable to directly compare logit and probit model with same parameters , as from the above approximation we can infer that the parameters of logit model will be approximately scaled by compared to probit model parameters. Thus, we will avoid comparing logit and probit model results.
Above equations demonstrate that the Shapley value of a feature for probability outcome depends on all feature values and variance. Intuitively variance is important for a value function which is conditional expectation of probability over . The spread of observation around the mean depends on variance and absolute changes in probability due to increase or decrease in by the same constant are not equal. Hence, positive and negative spread from mean are not equally influencing value function implying value function (in turn Shapley value) depends on variance. For example, when variance is close to 0, the value function equals to but for high variance and positive , positive spread from mean will have less influence compared to negative spread implying a reduction in value function. Figure 3 illustrates, Shapley value for feature 1 () as a function of and in the case of probit model. It indicates that the level curve for are non linear in both and and Shapley value for feature 1 will always increase with increase in feature 1 value but it can go in either way if we increase feature 2 value. This can also be inferred from the above expression.
Shapley value for Binary Decision ( or ):
(S3)  
Shapley value for binary decision outcome also depends on other feature values and their variance like probability outcome. Shapley value for binary decision is discontinuous at due to the presence of a binary indicator variable for decision. Thus, Shapley value for positive decision and negative decision can be significantly different even for similar feature values that are close to cut off. In other words, a minor change in sample can lead to a significant difference in Shapley value. Figure 4 highlights that the level curves for are discontinuous at the decision boundary.
4 Disagreements
In the last section, we observed that Shapley values for different outcomes have different expressions and level curves. For example, level curves for outcome are vertical lines, for probability they are non linear, and for binary decision, they are nonlinear and discontinuous. Due to these differences, Shapley value for different outcomes is expected to have few disagreements. In this section we will focus on few major disagreements described below:

Disagreement in baseline expectation: The baseline expectation for an outcome is equal to . Since the sum of Shapley value for all features is equal to prediction over and above the average value, disagreement in baseline expectation implies different reference points for Shapley value generated using different outcomes, which makes them incomparable with each other.

Disagreement in sign of Shapley value: Disagreement in sign of a Shapley value using different outcomes indicates a same feature is positively contributing for one outcome and negatively for the other. For example, feature 1 is positively contributing for probability outcome and negatively contributing for binary decision. Given, both probability and binary decision are monotonic transformation of , this disagreement between signs is counter intuitive especially in a case where there is no disagreement in baseline expectation.

Disagreement in most important feature: Feature with the highest absolute Shapley value can be interpreted as the most important feature and knowing the important feature plays an important role in understanding model prediction. A disagreement about the most important feature for different outcomes from the same model indicates that a feature can be most important for one outcome but not for the other.
These disagreements are discussed below in detail with the help of plots to highlight the region of disagreement.
4.1 Disagreement in Baseline Expectation
The baseline value for Shapley explanation is or i.e., unconditional expected value of the outcome. Baseline expectation and are measured in probability but is measured in logodds/standarddeviation units. Thus, it is not wise to directly compare the baseline expectation of three outcomes. Due to this, we will transform using such that its baseline expectation is comparable with other outcomes, here is standard normal in case of probit and logistic function or in case of logit regression. If there is no disagreement in baseline expectation, we have
where, equals to 1 in case of Probit model () and for the Logit model ().
For , the baseline expectation for is 0 and baseline expectation for probability/binarydecision is 0.5 implying all three baseline expectation are aligned, i.e.,
For , the baseline expectation (or ) for different outcomes are not aligned. Specifically, for (other case only affects the right most inequality)
Intuitively, the difference between the baseline of logodds/standarddeviation unit () and probability arises because the function is concave in , convex in and has point symmetry ( rotation symmetry). We know for a concave function, the expected value lies below the function value evaluated at expected value and converse is true in case of Convex. Because we have taken the underlying distribution to be symmetric, when most of the observations lie on the concave region and hence, the expected probability is less than the probability at expected (logodds or standard deviation unit). When , observations are symmetrically distributed in concave and convex region, thus we have expected probability equal to the probability at expected (logodds or standard deviation unit).
In the real world, we commonly have data with unbalanced classes and observing a class with probability more than 0.95 is not rare which implies high or . Thus, we could easily observe significant difference between , and this difference increases as probability of common class increases.
4.2 Disagreement in Sign
Given, there is no closed form solution for a feature Shapley value equal to zero () in the case of probability and binary decision outcomes. First, we will mathematically argue that there exists a point around the mean of the explanatory variable where there is disagreement of sign between Shapley value for and probability/binary decision when . Later we will graphically highlight the region even for the case when (no baseline difference).
Shapley value for
Shapley value for
Shapley value for Binary Decision
Above expressions illustrate the value of Shapley explanation for both features at mean and . Shapley values for probability and binary decision for both features are positive whereas Shapley explanation for (logodds/standarddeviation unit) is 0 for both features. Intuitively, this happens because there are two effects

Change in Expected (LogOdds or Standard deviation unit): This effect accounts for the contribution of variables that arise due to change in expected value of . This effect is present in Shapley explanation for all outcomes. But at the mean of a feature this effect is 0 as conditioning on a feature does not change the mean of conditional distribution. This is the only effect present in the Shapley value for . Thus, we have 0 Shapley value for both features for outcome at mean.

Reduction in Uncertainty of (LogOdds or Standard deviation unit): From earlier discussion/value function expression, we noticed that the value function for probability and binary decision also depends on variance. Reduction in the variance positively affects the value function (for ). When we condition on a variable, it reduces the variance and positively affects the value function. Because of this reason we have a positive Shapley value for probability and binary decision.
Since, Shapley explanation for a feature is monotonic in its value and continuous around mean for ^{4}^{4}4Shapley value for binary decision is continuous around mean for because the decision does not change around mean which is the only reason of discontinuity. There exists a point around the mean such that the Shapley explanation in logodds has opposite sign compared to Shapley value for probability and binary decision because the latter is strictly positive. Intuitively, at this point reduction in uncertainty effect dominates the change in expected effect.
Figure 5 and 6 illustrate the level curves for zero Shapley value for different outcomes for feature 1^{5}^{5}5Shapley value for binary decision is not exactly 0 for points overlapping with decision boundary (). But it separates the region of positive and negative Shapley value. Shapley value for feature 1 is positive for all points to the right of the level curve and negative to the left. It also illustrates the region where feature 1 Shapley explanation is positive for one outcome and negative for other even when there is no baseline difference. For example any point in a region between right of a green line and left of an orange has positive Shapley value for probability but negative for binary decision. Visually, this region for sign disagreement is higher for equal to 1.
4.3 Disagreement in Most Important Feature
We refer to the feature with highest absolute Shapley value as the most important feature. In order to avoid the explanation difference due to baseline expectation difference, we start with equal to 0, but results are valid irrespective of value. Under this condition, all baseline expectations are aligned (). Example below estimates Shapley value for and illustrates disagreement on the most important feature for different outcomes even when they agree on sign and there is no baseline difference.
Shapley value for
Shapley value for Probability
Shapley value for Binary Decision
Above expressions demonstrate that Shapley values of both features for are equal. But Shapley value for probability and Binary decision assign high value to feature with high variance (feature 1) even when all other parameters are same. This difference in Shapley value for features increases with increase in difference in their variance and the point . Since the difference between features Shapley value is continuous and monotonic in feature value for all three outcomes, there exists a point at which there is disagreement between the most important feature of compared to probability/binary decision. Later we highlight the region of disagreement on the most important feature for other outcomes.
It seems counter intuitive that the Shapley value for probability/binary decision for both features could be different even when their mean and regression coefficients are the same with . Here variance of the features is playing a crucial role. We can understand this with an example, consider a credit application model with regression intercept as 0, regression slope coefficient as 1, and features mean to be 0. Let the bank accept an application if the predicted probability is more than 0.5 equivalently 0 in and take . Further assume, is high but is low such that feature 2 can not affect by more than with 95% confidence. Although, both and are 0.1 but they are not equally important for the decision. We explain this by taking a hypothetical scenario where we know only one value. If we know , then we will be almost certain that the application will get accepted as it is highly unlikely that reduces the probability below the acceptance threshold. But in the opposite case, where we know , we are not confident that the application will get accepted because high variance in feature 1 may potentially decrease the probability value below acceptance threshold. This example signifies the role of variance in the feature importance.
To identify the region of disagreement on the most important feature, we start with estimating the curve for which absolute Shapley values for both features are equal. There are two cases in which both features can have equal Shapley values, first when Shapley values for both features have same sign () and second, where Shapley values for both features have opposite sign (). Below equations represent, equal importance curves for both cases (same sign and opposite sign) are linear in features value. Solution of for different outcomes intersect at , thus difference in equal importance curves with the same sign arises from the slope which does not depends on . Solution of for different outcomes are parallel to each other implying that the difference in equal importance curves with opposite sign arises from the intercept and when equals to 0 the lines for all outcomes are the same.

For :

same sign:

opposite sign:


For probability:

same sign:

opposite sign:


For Binary Decision:

same sign:

opposite sign^{6}^{6}6For binary decision outcome, sum of Shapley value is never equal to 0, but this sum has different sign for points above and below line . Thus we consider the line as a solution for . Note this simplification does not affect the disagreement results.:

In Figure 7 and 8, we indicate same sign equal importance curve with solid line and opposite sign equal importance curve with dashed line. For equal importance curves with opposite signs are exactly the same for all outcomes, the dashed lines overlap each other. Vertically bounded area between solid and dashed line (below solid line and above dashed line or above solid line and below dashed line) for particular outcome, is a region where feature 1 is more important than feature 2^{7}^{7}7This holds because, Shapley value for both feature is 0 at the intersection of dashed and solid line. The partial derivative of feature Shapley value w.r.t. is positive and greater than or equal to partial derivative of feature Shapley value w.r.t. for any feature .. Since the slope of equal Shapley explanation lines are different for different outcomes, there exists a region with disagreement on the most important feature. For example, a region bounded between red and blue solid line indicates that there is disagreement on the most important feature between and binary decision. For , the dashed lines are parallel to each other as shown in Figure 8. Visually, a region of disagreement increases, as distance between dashed lines increases.
5 Disagreements with Asymptotic Variance
In the last section, we have discussed disagreements in Shapley values of different outcomes and how they depend on the value of and variance of variables. When , few disagreements disappear or weaken. In this section we will focus on how disagreements will behave when variance is high or low. When overall variance is low, most of the observations lie in a small region implying that the probability curve can be approximated by a linear line in that region. When overall volatility is high, there will be few observations around 0 (where derivative is high) making the binary decision curve a better candidate for probability curve approximation. Hence we expect, for low variance Shapley value for probability and to be similar and for high variance Shapley value for probability and binary decision outcomes to be similar.
Although, we have normalized the mean of the features to 0 and their regression coefficient to 1 but in this section, we write an equation using the mean and regression coefficient to make it explicit. For probability outcome we will use logit regression but the same results hold for probit model.
Figure 9 illustrates the baseline expectation for different outcomes when mean for is positive (). For low variance^{8}^{8}8Overall model variance can be measured by the variance of which is , the baseline expectation for probability outcome is close to and for high variance the baseline expectation for probability outcome is close to binary decision.
In order to measure the scale of disagreement on sign and most important feature, we have simulated 1 million sample points for different expected value and variance of logodds. Table 1 and 2 describe the percentage of sample with sign disagreement and the percentage of sample with disagreement on most important feature respectively^{9}^{9}9When expected value of is high with low variance, the binary decision outcome is always positive, i.e., model outcome for binary decision is constant implying 0 Shapley value for all features. Hence, the table is left blank for binary decision with parameter ^{,}^{10}^{10}10For table 7 and 8 we take . Result can slightly differ for different parameter but conclusions are general.. In the case of low variance ( and ), there is almost no disagreement between logodds and probability outcome and in the case of high variance ( and ) there is almost no disagreement between probability and binary decision outcome. Both of these tables also highlight that disagreement percentage increases with increase in expected value of (leaving very high variance case). More details on disagreement for high/low variance case is available in Appendix B.
Sign Disagreement  





0  0.02  0.01  0.00%  21.63%  21.63%  
2  1  6.45%  17.51%  21.63%  
200  100  21.34%  0.30%  21.63%  
1  0.02  0.01  0.23%  
2  1  10.71%  16.81%  23.94%  
200  100  21.33%  0.30%  21.63% 
Important Feature Disagreement  




0  0.02  0.01  0.00%  6.94%  6.95%  
2  1  3.53%  3.42%  6.95%  
200  100  6.95%  0.00%  6.95%  
1  0.02  0.01  0.11%  
2  1  5.53%  5.24%  10.77%  
200  100  6.94%  0.00%  6.94% 
6 Global Feature Importance
Global feature importance for a model gives a number estimate for the importance of a feature at global level. It is commonly used to compare the global relevance of a feature and to understand which features are more relevant compared to others. Global feature importance is calculated by taking the sum of absolute Shapley value of the feature over all sample, i.e.,
where, denotes the global importance of feature and denotes the Shapley value of feature for sample . Global importance of feature for outcome (denoted by ) is given below^{11}^{11}11F. C. Leone. [leone1961folded] has shown for . This imply . Since global importance for the feature is scalar multiple of , relative global importance of the features for rely on their relative .
For probability and binary decision outcomes, global feature importance does not have any closed form solution. Thus, we have simulated 1 million samples with different expected values and variance of logodds. Table 3 illustrates the relative global importance of feature 1 for all outcomes and excess relative global feature importance of feature 1 for probability and binary decision outcomes compared to logodds. Relative global importance of feature 1 for equals which is taken as 2 or 5. This table is displaying that probability and binary decision outcome assign higher relative importance to feature 1 compared to the logodds outcome. Probability outcome is excessing relative global importance of feature 1 by 28% compared to logodd outcome when and . In case of low variance, global feature importance of logodds and probability outcomes are equal and for high variance, global feature importance of probability and binary decision outcomes are equal.

Relative Feature Importance Excess Relative Importance log odds probability binary decision probability binary decision 0 0.02 0.01 2.00 2.00 2.28 0.0% 13.8% 2 1 2.00 2.19 2.28 9.5% 13.8% 200 100 2.00 2.28 2.28 14.0% 13.8% 1 0.02 0.01 2.00 2.00 0.0% 2 1 2.00 2.18 2.24 9.0% 12.1% 200 100 2.00 2.28 2.28 14.0% 13.8% 0 0.05 0.01 5.00 5.00 6.26 0.0% 25.3% 5 1 5.00 6.41 6.26 28.3% 25.3% 500 100 5.00 6.28 6.26 25.7% 25.3% 1 0.05 0.01 5.00 5.00 0.0% 5 1 5.00 6.39 6.23 27.9% 24.5% 500 100 5.00 6.29 6.26 25.7% 25.3%
7 Conclusion
Shapley value is a method for explaining the contribution of features in prediction with a game theoretical foundation with certain desirable properties of fairness. To understand model prediction, it is essential to understand the contribution of its features and Shapley values assign that contribution value to features. We have already discussed what are the factors that influence the Shapley value of a linear probability model for different outcomes such as probability, logodds, binary decision.
Our principal findings include Shapley value for probability and binary decision outcomes depend on overall variance and other features value unlike logodds where Shapley explanation is a function of its mean, regression coefficient and feature value. Moreover, Shapley value for binary decision is discontinuous. There are disagreements in Shapley value for different outcomes, such as baseline expectation for Probability/logodds/binary decision outcomes are different implying different reference points, which make these values incomparable in terms of intermodel outcome. Sign of Shapley value for the same feature can be different for different outcomes because relevance of variance is different for different outcomes. Even most important features can vary for different outcomes suggesting that feature A can hold top importance for Probability outcome but might not be on top for decision making (accept or reject) outcome. These disagreements over Shapley values between probability and log odds outcomes dissolve if overall variance is low. When overall variance is high then there are minimal disagreements in Shapley values for Probability and binary decision outcomes. Global feature importance for probability and binary decision outcomes is more influenced with variance compared to logodds outcome.
In credit risk modeling the same model can be used to predict different outcomes. Given, there is no unique Shapley explanation for a model as they vary for different outcomes with some disagreements. We should estimate the Shapley explanation according to the usages of the model. For example, if the model is used for accepting or rejecting the loan application then Shapley explanation for binary decision is more suitable. If we are estimating probability of default then Shapley explanation for probability outcome is more appropriate. These conclusions are not only limited to linear probability model but are applicable to broader class of machine learning models.
Appendix A Derivation of Shapley value
Shapley value for :
Value function for is:
Now we apply Shapley value formula to estimate the value for each feature
Shapley value for Binary Decision ( or ):
Value function for Binary Decision outcome equals to
Now we apply the Shapley value formula to estimate the value for each feature
Shapley value for Probability ():
In order to estimate the value function for the logit/probit model, we need to simplify , where