## 1 Introduction

With machine learning (ML) models being increasingly employed in high-stakes domain such as criminal justice, finance, and healthcare, it is essential to ensure that the relevant stakeholders understand these models’ decisions. However, existing approaches to explain the predictions of complex machine learning (ML) models suffer from several critical shortcomings. Recent works have shown that explanations generated using attribution-based methods are not stable

(ghorbani2019interpretation; slack2019can; dombrowski2019explanations; adebayo2018sanity; alvarez2018robustness; bansal2020sam), e.g. that infinitesimal perturbations to an input can result in substantially different explanations.Existing metrics (alvarez2018robustness) measure the change in explanation only with respect to the input perturbations, e.g., they only assume black-box access to the predictive model, and don’t leverage potentially meaningful information such as the model’s internal representations to evaluate stability. To address these limitations of existing stability metrics, we propose Relative Stability that measures the change in output explanation with respect to the behavior of the underlying predictive model (Section 3.3). Finally, we present extensive theoretical and empirical analysis (Section 4.2) for comparing the stability of seven state-of-the-art explanation methods using multiple real-world datasets.

## 2 Related Works

This paper draws from two main areas of prior work: 1) attribution-based explanation methods, and 2) stability analysis of explanations.

Attribution-based Explanation Methods.

While a variety of approaches have been proposed to explain model decisions for classifiers, our work focuses on

*local feature attribution explanations*, which measure the contribution of each feature to the model’s prediction on a point. In particular, we study two broad types of feature attribution explanations: gradient-based and approximation-based. Gradient-based feature attribution methods like VanillaGrad (simonyan2013saliency), SmoothGrad (smilkov2017smoothgrad), Integrated Gradients (sundararajan2017axiomatic), and GradientInput (shrikumar2017learning) leverage model gradients to quantify how a change in each feature would affect the model’s prediction. Approximation-based methods like LIME (ribeiro2016should), SHAP (lundberg2017unified), Anchors (ribeiro2018anchors), BayesLIME, and BayesSHAP (slack2021reliable) leverage perturbations of individual inputs to construct a local approximation model from which feature attributions are derived.

Explanation Stability. Recent works have formalized desirable properties for feature attribution explanations (agarwal2022probing). Our work specifically focuses on the *stability* of explanations. alvarez2018robustness argued that “similar inputs should lead to similar explanations” and is the first work to formalize a metric to measure the stability of local explanation methods. We highlight potential issues with this stability metric that measures stability only w.r.t. the change in *input*.

## 3 Stability Analysis for Evaluating Explanations

### 3.1 Notation and Preliminaries

Machine Learning Model. Given a feature domain and label domain , we denote a classification model that maps a set of features to labels , where is a

-dimensional feature vector,

where C is the total number of classes in the dataset. We use to denote all the instances in the dataset. In addition, we define , whereis a scoring function (e.g., logits) and

is an activation function that maps output logit scores to discrete labels. Finally, for a given input

, the output predicted class label is: . We assume access to the gradients and intermediate representations of model .Explainability Methods. An attribution-based explanation method generates an explanation to explain model prediction . To calculate our stability metrics, we generate perturbations by adding infinitesimal noise to , and denote their respective explanation as .

### 3.2 Existing Definition and Problems

alvarez2018robustness formalize the first stability metric for local explanation methods, arguing that explanations should be robust to local perturbations of an input. To evaluate the stability of an explanation for instance , perturbed instances are generated by adding infinitesimally small noise to the clean instance such that :

(1) |

where is a neighborhood of instances similar to , and and denote the explanations corresponding to instances and , respectively. For each point , the stability ratio in Equation 1 measures how the output explanation varies with respect to the change in the *input*. Because the neighborhood of instances are sampled to be similar to the original instance , the authors argue that points that are similar should have similar model explanations, e.g., we desire the ratio in Equation 1 to be close to (alvarez2021from). This stability definition relies on the point-wise neighborhood-based local Lipschitz continuity of the explanation method around .

Problems. We note two key problems with the existing stability definition: i) it only assumes black-box access to the prediction model , and does not leverage potentially meaningful information such as the model’s internal representations for evaluating stability; and ii) it implicitly assumes that has the same *behavior* on inputs and that are similar. While this may be the case for underlying prediction models that are smooth or robust, this assumption may not hold in a large number of cases. In Figure 1, we discuss a toy example where perturbed samples have drastically different intermediate representations than the original point .
Note that since the goal of an explanation is to faithfully and accurately represent the behavior of the underlying prediction model (agarwal2022probing), we argue that an explanation method *should* vary for points and where the prediction model’s behavior differs. Thus, we argue for the inclusion of new stability metrics that measure how much explanations vary with respect to the behavior of the underlying prediction model.

### 3.3 Proposed metric: Relative Stability

To address the aforementioned challenges, we propose Relative Stability that leverages model information to evaluate the stability of an explanation with respect to the change in the a) input data, b) intermediate representations, and c) output logits of the underlying prediction model.

a) Relative Input Stability. We extend the stability metric in Equation 1 and define Relative Input Stability that measures the relative distance between explanations and with respect to the distance between the two inputs and .

(2) |

where the numerator of the metric measures the norm of the *percent change* of explanation on the perturbed instance with respect to the explanation on the original point , the denominator measures the norm between (normalized) inputs and and the term prevents division by zero in cases when norm is less than some small . Here, we use the percent change from the explanation on the original point to the explanation on the perturbed instance in contrast to the absolute difference between the explanations (as in Equation 1) to enable comparison across different attribution-based explanation methods that have vastly different ranges or magnitudes. Intuitively, one can expect similar explanations for points that are similar – the percent change in explanations (numerator) should be *small* for points that are close, or have a *small* norm (denominator). Note that the metric in Equation 2 measures instability of an explanation and higher values indicate higher instability.

b) Relative Representation Stability. Previous stability definitions in Equation 1-2

do not cater to cases where the model uses different logic paths (e.g., activating different neurons in a deep neural network) to predict the same label for the original and perturbed instance. In addition, past works have presented empirical evidence that the intermediate representations of a model are related to the underlying behavior or reasoning of the model

(agarwal2021towards). Thus, we leverage the internal features or representation learned by the underlying model and propose Relative Representation Stability as:(3) |

where denotes the internal model representation, e.g., output embeddings of hidden layers, and is an infinitesimal constant. Due to insufficient knowledge about the data generating mechanism, we follow the perturbation mechanisms described above to generate perturbed samples but use additional checks to ensure that for certain perturbations the model behaves similar to its training behavior. For any given instance , we generate local perturbed samples such that , and . For every perturbed sample, we calculate the difference in their respective explanations and using Equation 3 calculate the relative stability of an explanation. Note that, as before, the metric in Equation 3 measures instability of an explanation and higher values indicate higher instability.

Finally, we show that the Relative Input Stability can be bounded using the Lipschitzness of the underlying model. In particular, we proof that RIS is upper bounded by a product of the Lipschitz constant of the intermediate model layer (assuming a neural network classifier) and our proposed Relative Representation Stability. See Appendix A for the complete proof.

(4) |

c) Relative Output Stability. Note that Relative Representation Stability assumes that the underlying ML model is white-box, i.e., explanation method has access to the internal model knowledge. Hence, for black-box ML models we define Relative Output Stability as:

(5) |

where and are the output logits for and , respectively. Again, we proof that RRS is upper bounded by a product of the Lipschitz constant of the output model layer and our proposed Relative Output Stability. See Appendix A for the complete proof.

(6) |

## 4 Experiments

To demonstrate the utility of relative stability, we systematically compare and evaluate the stability of seven explanation methods using three real-world datasets using equations defined in Section 3. Further, we show that, in contrast to relative input stability, relative representation and output stability better captures the stability of the underlying black-box model.

### 4.1 Datasets and Experimental Setup

Datasets. We use real-world structured datasets to empirically analyze the stability behavior of explanation methods and consider 3 benchmark datasets from high-stakes domains: i) the German Credit dataset (Dua:2019) which has records of 1,000 clients in a German bank. The downstream task is to classify clients into good or bad credit risks, ii) the COMPAS dataset (larson2016we) which has records of 18,876 defendants who got released on bail at the U.S state courts during 1990-2009. The dataset comprises of features representing past criminal records and demographics of the defendants and the goal is to classify them into bail or no bail, and iii) the Adult dataset (Dua:2019) which has records of 48,842 individuals including demographic, education, employment, personal, and financial features. The downstream task is to predict whether an individual’s income exceeds $50K per year.

Predictors.

We train logistic regression (LR) and artificial neural network (ANN) as our predictive models. Details in Appendix

B.Explanation methods. We evaluate seven attribution-based explanation methods, including VanillaGrad (simonyan2013saliency), Integrated Gradients (sundararajan2017axiomatic), SmoothGrad (smilkov2017smoothgrad), InputGradients (shrikumar2017learning), LIME (ribeiro2016should), and SHAP (lundberg2017unified). Following agarwal2022probing

, we also include a random assignment of importance as a control setting. Details on implementation and hyperparameter selection for the explanation methods are in Appendix

B.Setup. For each dataset and predictor, we: (1) train the prediction model on the respective dataset; (2) randomly sample points from the test set; (3) generate perturbations for each point in the test set; (4) generate explanations for each test set point and its perturbations using seven explanation methods; and (5) evaluate the stability of the explanations for these test points using all stability metrics (Equations 2,3,5).

### 4.2 Results

Empirically verifying our theoretical bound. We empirically evaluated our theoretical bounds by computing the LHS of Equation 4 for all seven explanation methods. Results in Figure 2 illustrate the empirical and theoretical bounds for the Relative Input Stability, confirming that none of our theoretical bounds are violated. In addition, we observe that, on average across all explanation methods, our upper bounds are tight with the mean theoretical bounds being 233% higher than that of the empirical values. Similar results are found for other datasets in Appendix 5.

Evaluating the stability of explanation methods. We compare the stability of explanation methods by computing instability using all three variants as described in Section 3.3. Results in Figure 3 show that the median instability of all explanation methods using Relative Input Stability (Figure 3; blue) are lower than that for the Representation (Figure 3; green) and Output Stability (Figure 3; orange) because the relative input stability (Equation 2) scores are highly influenced by the input differences (), i.e., the median RIS scores across all explanation methods are always lower than RRS and ROS. Finally, we observe that while no explanation method is completely stable, on average across all datasets and representation stability variants, the SmoothGrad explanation method generates the most stable explanation and outperforms other methods by 12.7%.

## 5 Conclusion

We introduce Relative Stability

metrics that measure the change in output explanation with respect to the behavior of the underlying predictive model. To this end, we analyze the stability performance of seven state-of-the-art explanation methods using multiple real-world datasets and predictive models. Our theoretical and empirical analysis demonstrate that representation and output stability indicates that SmoothGrad explanation method generates the most stable explanation. We believe that our work is an important step towards developing a broader set of evaluation metrics that incorporate the behavior of the underlying prediction model.

## References

## Appendix A Theoretical Interpretation

Prior works have shown that commonly used artificial neural network (ANN) models comprise of linear and activation layers which satisfy Lipschitz continuity (gouk2021regularisation). Let us consider a 2-layer ANN model , where and represent the outputs of the first and second hidden layers, respectively. For a given input and its perturbed counterpart , we can write the Lipschitz form for the first hidden layer as:

(7) |

where is the Lipschitz constant of the hidden layer . Taking the reciprocal of Equation 7, we get:

(8) |

Multiplying both sides with , we get:

(9) |

With further simplifications, we get:

(10) |

For a given and representations from model layer , using Equations 2-3, we get:

(11) | |||

(12) |

where we find that the Relative Input Stability score is upper bounded by times times the Relative Representation Stability score. Finally, we can also extend the above analysis by substituting with the output logit layer and show that the same relation holds for Relative Output Stability:

(13) |

where .

## Appendix B Implementation Details

Predictors. We train logistic regression (LR) and artificial neural network (ANN) models. Details in Appendix B. The ANN models have 1 hidden layer of width

followed by a ReLU activation function and the output Softmax layer.

Predictor Training.

To train all predictive models, we used a 80-10-10 train-test-validation split. We used the RMSProp optimizer with learning rate

, the binary cross entropy loss function, and batchsize

. We trained for epochs and selected the model at the epoch with the highest validation set accuracy as the final prediction model to be explained in our experiments.Explanation Method Implementations. We used existing public implementations of all explanation methods in our experiments. We used the following captum software package classes: i) captum.attr.Saliency for VanillaGrad; ii) captum.attr.IntegratedGradients for IntegratedGradients; iii) captum.attr.NoiseTunnel; iv) captum.attr.Saliency for SmoothGrad; v) captum.attr.InputXGradient for GradientsInput; and vi) captum.attr.KernelShap for SHAP. We use the authors’ LIME python package for LIME.

Metric hyperparameters. For all metrics, we generate a neighborhood of size for each point . The neighborhood points were generated by perturbing the clean sample with noise from

. For data sets with with discrete binary inputs we used independent Bernoulli random variables for the pertubations: for each discrete dimension, we replaced the original values with those that were drawn from a Bernoulli distribution with parameter

. Choosing a small ensures that only a small fraction of samples are perturbed to reduce the likelihood of sampling an out-of-distribution point. For internal model representations we use the pre-softmax input linear layer output embedding for the LR models, and the pre-ReLU output embedding of the first hidden layer for the ANN.Explanation Method | Hyperparameter | Value |
---|---|---|

n_samples | ||

LIME | kernel_width | |

std | ||

SHAP | n_samples | |

SmoothGrad | std | 0.05 |

Integrated Gradients | baseline | train data means |

Random Baseline | attributions from |