Many decision making systems deployed in the real world are not static - a phenomenon known as model adaptation takes place over time. The need for transparency and interpretability of AI-based decision models is widely accepted and thus have been worked on extensively. Usually, explanation methods assume a static system that has to be explained. Explaining non-static systems is still an open research question, which poses the challenge how to explain model adaptations. In this contribution, we propose and (empirically) evaluate a framework for explaining model adaptations by contrastive explanations. We also propose a method for automatically finding regions in data space that are affected by a given model adaptation and thus should be explained.READ FULL TEXT VIEW PDF
Humans give contrastive explanations that explain why an observed event
There has been a recent resurgence of interest in explainable artificial...
With the increasing deployment of machine learning systems in practice,
To make advanced learning machines such as Deep Neural Networks (DNNs) m...
Explaining decisions of deep neural networks is a hot research topic wit...
Much work aims to explain a model's prediction on a static input. We con...
Explaining the decisions of machine learning models is becoming a necess...
Machine learning (ML) and artificial intelligence (AI) based decision making systems are increasingly affecting our daily life - e.g. predictive policing  and loan approval [21, 40]. Given the impact of many ML and AI based decision making systems, there is an increasing demand for transparency and interpretability  - the importance of these aspects was also emphasized by legal regulations like the EUs GDPR . In the context of transparency and interpretability, fairness and other ethical aspects become relevant [25, 13].
As a consequence, the research community extensively worked on these topics and came up with methods for explaining ML and AI based decision making systems and thus meeting the demands for transparency and interpretability [18, 19, 36, 32]. Popular explanations methods [26, 36] are feature relevance/importance methods  and examples based methods . Instances of example based methods are counterfactual explanations [39, 38], influential instances  and Prototypes & criticisms  - these methods use a set or a single example for explaining the behavior of the system. Counterfactual and contrastive explanations in general [26, 39, 38] are popular instances of example based methods for locally explaining decision making systems  - the reason why these types of explanation are so popular is that because there exists strong evidence that explanations by humans (which they try to mimic) are often counterfactual in nature . While some of these methods are global methods - i.e. explaining the system globally - most of the example based methods are local methods that try to explain the the behavior of the decision making system at a particular instance or in a “small” region in data space [26, 31, 29, 10]. Further, existing explanation methods can be divided into model agnostic and model specific methods. While model agnostic methods view the decision making system as a black-box and thus do need access to any model internals, model specific methods rely and usually exploit model internal structures and knowledge for computing the explanation. However, distinguishing between model agnostic and model specific methods is not that strict because there exist model specific methods that aim for efficiently computing (initially model agnostic) explanations of specific models .
The majority of the proposed explanation methods in literature assume fixed models - i.e. explaining the decisions of a fixed decision making system.
However, in practice decision making systems are usually not fixed but (continuously) evolving over time - e.g. the decision making system is adapting or fine tuned on new data . In this context it becomes relevant to explain the changes of the decision making system111E.g. The authors of  discuss the problem that explanations of a changing classifier can become invalid (i.e. expire) after some time and thus pose a major problem in algorithmic recourse.
discuss the problem that explanations of a changing classifier can become invalid (i.e. expire) after some time and thus pose a major problem in algorithmic recourse.- in particular in the context of Human-Centered AI (HCAI) which, besides explainability, is another important building block222Some people even argue that explainability and transparency are an essential part of HCAI [33, 34] in ethical AI 
. HCAI allows the human being (the people) to “rule” AI systems instead of being “discriminated” or “cheated” by AI. Given the complexity of many modern ML or AI systems (e.g. Deep Neural Networks), it is usually difficult for a human to understand the decision making system or the impact of some adaptation or changes applied to a given system. Yet, understanding the impact of changing a system in a given way is crucial for rejecting system changes that violates some (ethical) guidelines or (legal) constraints.
For example if we consider the scenario of a (non-trivial) loan approval system that automatically accepts or rejects loan applications - i.e. we assume that the decision making process of this system is highly complicated and difficult to inspect from the outside (e.g. it might be a Deep Neural Network): We might encounter a situation in which a loan application was rejected with the argument of a low income and a bad payback rate in the past - which perfectly meets the bank internal guidelines for accepting or rejecting a loan. Next, we adapt the loan approval system on new data - we assume that we got new data for fine tuning the system - but the we assume that the guidelines or policies for accepting or rejecting did not change. But after this adaptation, it turns out that the same application that was rejected under the “old” system (before the adaptation), it is now accepted by the new system - that is we assumes that this changed behavior violates the risk-guidelines of the bank because it exposes the bank to an unnecessarily higher risk of loosing money. This is an example in which case we would like to reject the given model adaptation because this adaptation would lead to a system that violates some predefined rules. Since in practice we do not always have a detailed policy available333Otherwise there would be no or very limited need for using some ML or AI system that learns such a policy from data., we need a mechanism that makes the impact of model adaptations/changes transparent so that it can be “approved” by a human444Ideally the explanation of the adaptations/changes are simple enough to be understood by lay persons instead of only be accessible to ML or AI experts..
Although there exist general overview work that is aware of the challenge of explaining changing ML systems , how to exactly do this is still an open research question which we aim to address in this contribution. In this work, we propose a framework that uses contrastive explanations for explaining model adaptation - i.e. we argue that inspecting the changes/differences of contrastive explanations is a reasonable proxy for explaining model adaptations. More precisely, our contributions are:
We propose to compare contrasting explanations for locally explaining model adaptations.
We propose a method for finding relevant and interesting regions in data space which are affected by a given model adaptation and thus should be explained to the user.
We propose persistent local explanations for regularizing the model adaptation towards models with a specific behaviour.
The remainder of this work is structured as follows: After briefly reviewing literature and taking a look at the foundations like contrasting explanations (section 2.2) and model adaptations (section 2.1) we introduce and describe our proposal for using contrastive explanations for locally explaining model adaptations in section 3. In this context, we then study counterfactual explanations as a specific instance of contrastive explanations in section 4 - in particular we study robustness (see section LABEL:sec:basic_bounds_robustness_model_drift) and propose a method for finding relevant and interesting and relevant regions in data space (see section 4.2). In section 5, we introduce our idea of persistent contrasting explanations - we consider different types of persistent explanations constraints and study how to add them to the model adaptation optimization problem. Finally, we empirically evaluate our proposed methods in section 6 and close our work with a summary and discussion in section 7.
Due to space constraints and for the purpose of better readability, we include all proofs and derivations in appendix 0.A.
While (concept) drift as well as transparency (i.e. methods for explaining decision making systems) have been extensively studied separately, the combination of both have received much less attention so far.
Counterfactual explanations are a popular instance of example based explanation methods but all existing methods so far assume that the underlying model which is explained does not change over time - a strategy for counterfactual explanations of changing/drifting models is still missing .
A method called “counterfactual metrics”  can be used for explaining drifting feature relevances of metric based models. In contrast to a counterfactual explanation, it focuses on feature relevance rather than change of counterfactual examples. The authors of  consider a scenario in which a metric based model is adapted to drifting feature relevances and the resulting model adaptation is explained by the the changes made to the distance metric which they call “counterfactual metrics”.
In , contrastive explanations (in particular counterfactual explanations) are used for explaining concept drift. For this purpose a classifier is constructed which tries to separate two batches of data that are assumed to be affected by concept drift - the concept drift is explained by using contrastive explanations that contrast a sample from one class (i.e. batch) to the other class/batch under the trained classifier. The authors also propose a method for finding interesting and relevant samples which they call “characteristic samples” that are affected by the concept drift and thus promising candidates for illustrating and explaining the present drift.
We assume a model adaptation scenario in which we are given a prediction function (also called model) and a set of (new) labeled data points . Adapting the model to the data means that we want to find a model which is as similar as possible to the original model while performing well on labeling the (new) samples in . Model adaptation can be formalized as an optimization problem  as stated in Definition 1.
Let , be a prediction function (also called model) and a set of (new) labeled data points. Adapting the model to the data is formalized as the following optimization problem:
where denotes a regularization that measures the similarity between two given models 555In case of a parameterized model, one possible regularization measures the difference in the parameters. and refers to a suitable prediction error (e.g. zero-one loss or squared error) which is minimized by .
Note that for large , e.g. caused by abrupt concept drift , one could completely retrain and abandon the requirement of closeness. In such situations it is still interesting to explain the difference of and .
Counterfactual explanations are a popular instance of contrastive explanations. A counterfactual explanations - often just called counterfactual or pertinent positive by some authors  - states a change to some features of a given input such that the resulting data point (called counterfactual) has a different (specified) prediction than the original input. The rational is considered to be intuitive, human-friendly and useful because it tells the user which minimum changes can lead to a desired outcome [26, 39]. Formally, a (closest) counterfactual can be defined as follows:
Assume a prediction function is given. Computing a counterfactual for a given input is phrased as an optimization problem:
denotes the loss function,the requested prediction, and a penalty term for deviations of from the original input . denotes the regularization strength.
In the following we assume a binary classification problem. In this case we denote a (closest) counterfactual according to Definition 2 of a given sample under a prediction function as as the desired target is uniquely determined.
The authors of  define a contrastive explanations consisting of two parts: a pertinent negative (counterfactual) and a pertinent positive. A pertinent positive [14, 15, 7] describes a minimal set of features that are already sufficient for the given prediction. As already mentioned, pertinent positives are usually considered as a part or addition of contrastive explanations and are assumed to provide additional insights why the model took a particular decision .
A pertinent positive of a sample describes a minimal set of features (of this particular sample ) that are sufficient for getting the same prediction - these features are also called “turned on” features and all other features are called “turned off” meaning that they are set to zero or any other specified default value. One could either require that all “turned on” features are equal to their original values in or one could relax this and only require that they are close to their original values in . The computation of a pertinent positive (also called sparsest pertinent positive) can be phrased as the following multi-objective optimization problem : equationparentequation
where denotes the selection operator on the set and denotes the set of all “turned on” features.666 The selection operator returns a vector, whereby it only selects a subset of indices from the original vector as specified in the set
The selection operator returns a vector, whereby it only selects a subset of indices from the original vector as specified in the set. is defined as follows:
where denotes a tolerance threshold at which we consider a feature “to be turned on” - e.g. a strict choice would be .
Since (3) is difficult to solve - a number of different methods for efficiently computing (approximate777The approximation comes from giving up closeness - many methods successfully compute pertinent positives but can not guarantee that they are globally optimal.) pertinent positives have been proposed [14, 15, 7].
An obvious approach for explaining a model adaptation would be to explain and communicate the regions in data space where the prediction of the new and the old model are different - i.e. . However, in particular for incremental model adaptations, this set might be small and its characterization not meaningful. Hence, instead of finding these samples, where the two models compute different predictions888Which of course is an obvious and reasonable starting point for explaining the difference between two models., we aim for an explanation of their learned generalization rules. Because of the constraint in Eq. (1), it is likely the case that both models compute the same prediction on all samples in a given test. However, the reason for these predictions can be arbitrarily different (depending on the regularization ) - i.e. the internal prediction rules of both models are different. We think that explaining and thus understanding how and where the reasons and rules for predictions differ are much more useful than just inspecting points where the two models disagree - in particular when it comes to understand and judging decision making systems in the context of human centered AI.
In the following, we propose that a contrastive explanation can serve as a proxy for the model generalization at a given point; hence a comparison of the possibly different contrastive explanations of two models at a given point can be considered as an explanation of how the different underlying principles based on which the models propose a decision. As it is common practice, we thereby look at local differences, since a global comparison might be too complex and not easily understood by a human [26, 38, 11]. Furthermore, it might also be easier to add constraints on specific samples instead of constraints on the global decision boundary to the model adaptation problem Eq. (1) in Definition 1. The computation of such differences of explanations is tractable as long as the computation of the constrastive explanation under a single model itself is tractable - i.e. no additional computational complexity is introduced when using this approach for explaining model adaptations.
We define this type of explanation as follows:
We assume that we are given a set of labeled data points whose labels are correctly predicted by both models and .
For every , let be a contrastive explanation under and under the new model . The explanation of the model differences at point and its magnitude is then given by the comparison of both explanations:
where denotes a suitable operator which can compare the information contained in two given explanations and denotes a real-valued distance measure judging the difference of explanations.
Note that the explanation as defined in Definition 3 can be more generally applied to compare two classifiers and w.r.t. given input locations, albeit does not constitute an adaptation of . For simplicity, we assume uniqueness of contrastive explanations in the definition, either by design such as given for linear models or by suitable algorithmic tie breaks.
The concrete form of the explanation heavily depends on the comparison function and - this allows us to take specific use-cases and target groups into account. In this work we assume and is given by the component-wise absolute value , and we consider two possible realizations of :
An obvious measurement of the difference of two explanations is to compute the Euclidean distance between them:
where on could also use a different -norm (e.g. ).
We can also measure the difference between two explanations by the cosine of their respective angle:
Note that considering the angle, instead of the actual distance, has the advantage that it is scale invariant - i.e. it is more interesting if different features have to be changed, rather than the same features have to be changed slightly more.
Since the image of the cosine is limited to , we can directly compare the values of different samples - whereas the general norm is only capable of comparing nearby points with each another as it contains information regarding the local topological structure.
Counterfactual explanations Definition 2 are a popular instance of contrastive explanations (section 2.2). In the following, we study counterfactual explanations for explaining model adaptations as proposed in Definition 3. We first relate the difference between two linear classifiers to their counterfactuals and, vice versa, the change of counterfactuals to model change. Finally, we propose a method for finding relevant regions and samples for comparing counterfactual explanations.
First, we highlight the possibility to relate the similarity of two linear models at a given point to their counterfactuals. We consider a linear binary classifier :
and assume w.l.o.g. that the weight vector has unit length, . We assume for an adaptation with unit weight vector
In Theorem 4.1
we state how to use counterfactuals for approximately computing the local cosine similarity between two models - we interpret this as evidence for the usefulness of counterfactual explanations for measuring the difference between two given models.
Let and be two linear models, and a data point. Let and be the closest counterfactual of a data point under the original model resp. the adapted model . Then
Since every (possibly nonlinear) model can locally be approximated linearly, this result also indicates the relevance of counterfactuals to characterize local differences of two models.
Conversely, it is possible to limit the difference of counterfactual explanations by the difference of classifiers as follows:
Let be a binary linear classifier Eq. (8) and be its adaptation.Then, the difference between the two closest counterfactuals of an arbitrary sample can be bounded as:
The task of sample based model comparison obviously requires the selection of feasible samples, as the amount of samples is usually to large to be dealt with by hand. Thus, we need to formalize a notion of characteristic samples in the context of model change to perform this sub-task automatically. In this section, we aim to formalize this notion and give a number of possible choices and respective approximation regarding this problem.
The idea is to provide an interest function, i.e. a function that marks the regions of the data space that are of interest for our consideration - we could use such a function for automatically finding interesting samples by applying it to a set of points to get a ranking or optimizing over the function for coming up with interesting samples. This function should have certain properties:
For every pair of fixed models it maps every point in the data space to a non-negative number - i.e. .
It should be continuous with respect to the classifiers and in particular for all if and only if .
Points that are “more interesting” should take on higher values.
Regions where the classifiers coincide are not of interest.
The last two properties are basically a localized version of the second in the sense that it forces to turn properties of the decision boundary (which are global) into local, i.e. point wise, properties. It is possible to make the properties 3 and 4 rigorous, but this would require an inadequate amount of theory.
Then it is easy to see that the four properties are fulfilled, if we assume that and is chosen correctly:
The first property follows from the definition of dissimilarity functions and so does the second. The fourth property follows from the fact that if the classifiers intrinsically perform the same computations then the counterfactuals are the same and hence ; on the other hand, if the classifiers (intrinsically) perform different computations (thought the overall output may be the same) then the counterfactuals are different and hence the . In a comparable way the third property is reflected by the idea that obtaining counterfactuals is faithful to the computations in the sense that slight resp. very different computation will lead to slight resp. very different counterfactuals.
Besides the Euclidean distance Eq. (6), the cosine similarity Eq. (7) is a potential choice for comparing two counterfactuals in . Since the cosine always takes values between and , we scale it to a positive codomain:
for some small . In this case the samples on the decision boundary are marked as not interesting, which fits the finding that the counterfacutals for those samples basically coincide with the samples them self and therefore do not provide any (additional) information.
While the definition of the interest function Eq. (11) perfectly captures our goal of identifying interesting samples, it can be computational difficult to compute. In particular the computation of (closest) counterfactual explanations can be computational expensive and “challenging” for many models  - this becomes a major issue when optimizing, that is searching for local maxima of . It is hence of importance to find a surrogate for the counterfactual that allows for fast and easy computation.
In these cases, an efficient approximation is possible, provided the classifier is induced by a differentiable function in the form . Then the gradient of enables an approximation of the counterfactual in the following form:
for a sufficient . In this case Eq. (14), the cosine similarity approach Eq. (12) works particularly well because it is invariant with respect to the choice of - i.e. can be ignored and we only need the gradient. Another benefit of this choice is that, under some smoothness assumptions regarding the classifier, admits simple geometric interpretations of the obtained values as the gradient always point towards the closes point on the decision boundary. This way it (locally) reduces the interpretation to linear classifiers for which counterfactual explanations are well understood .
In the remainder of this work, we use the gradient approximation together with the cosine similarity for computing the “usefulnes” of given samples for comparing their counterfactual explanations.
In the previous sections we proposed and studied the idea of comparing contrastive (in particular counterfactual) explanations for explaining model adaptations. In the experiments (see section 6) we observe this method is indeed able to detected and explain less obvious and potential problematic changes of the internal decision rules of adapted models.
In this context of explaining model adaptations, Human-Centered AI comes into play when the user rejects the computed model adaptation based on the explanations. For instance it might happen, that the local explanation under the old model was accepted, but the new local explanation under the new model violates some rules or laws - see the introduction for an example. In such a case, we want to constrained the model adaptation Definition 1 such that (some) local explanations remain the same or valid under the new model - i.e. making some local explanations persistent - and to push the new model towards globally accepted behavior by making use of such local constraints.
For the purpose of “freezing” a local explanation in the form of a contrastive explanation - i.e. making it persistent -, we propose the following (informal) requirements:
Distance to the decision boundary must be within a given interval.
Counterfactual explanation must be still (in-)valid.
Pertinent positive must be still (in-)valid.
We always assume that we are given a labeled sample that is correctly labeled by the old model as well as the new model - i.e. . In the subsequent section, we study how to write constraint Eq. (15c) in Eq. (15) for the different requirements/constraints as listed in section 5.1 - it turns out that we can often write the constraints (at least after a reasonable relaxation) as additional labeled samples which enable a straight forward incorporation into many model adaption procedures (see section 5.3 for details):
In case of a classifier, one might require that the distance to the decision boundary () is not larger than some fixed . Applying this to our model adaption setting, we get the following constraint:
However, because reasoning over distances to decision boundaries might be a too complicated and often difficult to formalize as a computational tractable constraint, one might instead require that all samples that are “close” or “similar” to must be have the same prediction :
where we defined the set of all “similar”/”cose” points as follows:
where denotes an arbitrary similarity/closeness measure - e.g. in case of real valued features the -norm might be a popular choice. If contains a “small” number of elements only, then Eq. (18) is computational tractable and can be added as a set of constraints to the optimization problem Eq. (15). However, in case of real valued features where we use the -norm (e.g. or ) as a closeness/similarity measure, we get the following constraint:
Note that constraints of the form of Eq. (20) are well known and studied in adversarial robustness literature  - these constraints reduce the problem to a training an locally adversarial robust model .
Further relaxing the idea of a persistent distance to the decision boundary might lead to requirements where a set of features is increased or decreased such that the original prediction remains the same. For instance one might have a set of which must not change the prediction if added to the original sample , yielding the following constraint:
Recall that in a counterfactual explanation, we add a perturbation to the data point which results in a (specified) prediction different from :
where we defined .
Requiring that the same counterfactual explanations holds for the adapted model , yields the following constraint:
Note that with constraint Eq. (23) alone, we can not guarantee that will be the closest counterfactual of under - although it is guaranteed to be a valid counterfactual explanation. However, we think that computing the closest counterfactual is not that important because the closest counterfactual is very often an adversarial which might not be that useful for explanations [26, 6] and for sufficiently complex models, computing the closest counterfactual becomes computational difficult . Furthermore, closeness becomes even less important when dealing with plausible counterfactuals which are usual not the closest ones  - if is a plausible counterfactual under one would expect that it is also plausible under because the data manifold itself is not expected to change that much.
Recall that a pertinent positive describes a sparse sample where all non-zero feature values are as close as possible to the feature values of the original sample and the prediction is still the same:
Requiring that is still a pertinent positive of under the adapted model , yields the following constraint:
Similar to the case of persistent counterfactual explanations, Eq. (25) does not guarantee that is the sparsest or closest pertinent positive of under - it could happen that there exists an even sparser or closer pertinent positive of under which was invalid under the old model . However, it is guaranteed that is a sparse pertinent positive of under which we consider to be sufficient for practical purposes, in particular if taking into account the computational difficulties of computing a pertinent positive - as stated in , computing a pertinent positive (even of “classic” ML models) is not that easy.
We consider a scenario where we have a sample wise loss function999E.g. smth. like the squared error or negative log-likelihood. that penalizes prediction errors and a set of (new) labeled data points to which we wish to adapt our original model - we rewrite the model adaptation optimization problem Eq. (1) as follows:
where the hyperparameterallows us to balance between closeness and correct predictions.
Next, we assume that we have a bunch of persistence constraints of the form as discussed in the previous section 5.2. Considering these constraints, we rewrite the constrained model adaptation optimization problem Eq. (15) as follows:
where we introduce a hyperparameter that denotes a regularization strength which, similar to the hyperparameter , helps us enforcing satisfaction of the additional persistence constraints - encoded as Eq. (15c) in the original informal modelling Eq. (15).
Assuming a parameterized model, we can use any black-box optimization method (like Downhill-Simplex) or a gradient-based method if Eq. (27) happens to be fully differentiable with respect to the model parameters. However, such methods usually come without any guarantees and are highly sensitive to the solver and the chosen hyperparameters and . Therefore, one would be advised to use exploit model specific structures for efficiently solving Eq. (27) - e.g. write Eq. (27) in constrained form and turn it into a convex program.
We empirically evaluate each of our proposed methods separately. We demonstrate the usefulness of comparing contrastive explanations for explaining model adaptations in section 6.2, and in section 6.3 we evaluate our method for finding relevant regions in data space that are affected by the model adaptations and thus are interesting candidates for illustrating the corresponding difference in counterfactual explanations (see section 4.2). Finally, we demonstrate the effectiveness of persistent local explanation for pushing the model adaptation towards a desired behaviour.
The Python implementation of all experiments is available on GitHub101010https://github.com/andreArtelt/ContrastiveExplanationsForModelAdaptations. We use the Python toolbox CEML  for computing counterfactual explanations and use MOSEK111111We gratefully acknowledge a academic license provided by MOSEK ApS. as a solver for all mathematical programs.
We use the following data sets in our experiments:
This artificial toy data set consists of a binary classification problem and is generated by sampling from two different two dimensional Gaussian distributions - each class has is its own Gaussian distribution. The drift is introduced by changing the Gaussian distributions between the two batches. In the first batch the two classes can be separated with a threshold on the first feature, whereas in the second batch the second feature must be also considered.
The “Boston Housing Data Set”  is a data set for predicting house-prices (regression) and contains samples each annotated with real and positive dimensional features. We introduce drift by putting all samples with a NOX value lower or equal than into the first batch and all other samples into the second batch.
The human activity recognition (HAR) data set by  contains data from volunteers performing activities like walking, walking downstairs and walking upstairs. Volunteers wear a smartphone recording the three-dimensional linear and angular acceleration sensors. We use a time window of length to aggregate the data stream and computed the median per sensor axis and time window. We only consider the activities walking, walking upstairs and walking downstairs. We create drift by putting half of all samples with label walking or walking upstairs into the first batch - i.e. the classifier has to distinguish walking vs. walking upstairs - and all other samples, the other half of walking together with samples labeled as waking downstairs into the other batch - i.e. for the second batch the classifier has to distinguish normal walking vs. walking downstairs.
The “German Credit Data set”  is a data set for loan approval and contains samples each annotated with attributes ( numerical and categorical) with a binary target value (“accept” or “reject”). We use only the first seven features: duration in month, credit amount, installment rate in percentage of disposable income, present residence since, age in years, number of existing credits at this bank and number of people being liable to provide maintenance for. We introduce drift by putting all samples where age in years is less or equal than into the first batch and all other samples into the second batch.
The data set consists of hyperspectral measurements of three types of coffee beans measured at three distinct times within three month of 2020. Samples of Arabica, Robusta and immature Arabica beans were measured by a SWIR_384 hyperspectral camera produced by Norsk Elektro Optikk. The sensor measures the reflectance of the samples for 288 wavelengths lying in the area between and nm. For our experiments, we standardize and subsample the data by a factor of . Prior analysis of the data set indicates, that there the data distribution is drifting between the measurement times. Labelwise means of the data per measurement time are shown in Fig. 1.
We fit a Gaussian Naive Bayes classifier to the first batch and then adapt the model to the second batch of the Gaussian blobs data set. Besides the both batches, we also generatesamples (located between the two Gaussians) for explaining the model changes using counterfactual explanations. We compute counterfactual explanations for all test samples under the old and the adapted model. The differences of the counterfactuals are shown in the right plot of Fig. 2. We observe a significant change in the second feature of the adapted model - which makes sense since we know that, in contrast to the first batch, the second feature is necessary for discriminating the data in the second batch.
We fit a linear regression model to the first batch and then completely refit the model on the first and second batch of the house prices data set. We use the the test data from both batches for explaining the model changes using counterfactual explanations. We compute counterfactual explanations under the old and the adapted model whereas we always use a target prediction ofand allow a deviation of . The differences of the counterfactuals are show in the left plot of Fig. 2. We observe that basically only the feature NOX changes - which makes sense because we split the data into two batches based on this feature and we would also consider this feature to be relevant for predicting house prices.
We fit a Gaussian Naive Bayes classifier to the first batch and then adapt the model to the second batch of the human activity recognition data set. We use the the test data from both batches for explaining the model changes using counterfactual explanations. The differences of the counterfactuals (separated by the target label) are show in Fig.3. In both cases we observe some noise but also a significant change in the Y axis of the acceleration sensor and the X axis of the gyroscope - both changes looks plausible because switching between walking up-/downstairs should affect the Y axis of the acceleration sensor while walking straight might be measurable by the X axis of the gyroscope, but since this is a real world data set, we do not really know the ground truth.
We are considering the model drift between a model trained with the data collected on the 26th June and another model based on the data from 14th August (from the 28th August in a second experiment). As the we know that the drift in our data set is abrupt, we train a logistic regression classifier on the training data collected at the first measurement time (model_1), and another on the second measurement time (model_2). We compute counterfactual explanations for all the samples in test set of the first measurement time that are classified correctly by model_1 but misclassified by model_2. The target label of the explanation is the original label. This way, we analyze how the model changes for the different measurement times. The mean difference between the counterfactual explanation and the original sample or visualized in Fig.4. We observe that there are (interestingly) only a few frequencies which a are consistently differently treated by both model.
We follow the same procedure like in section 6.2 but this time we do not use all test samples but only the (approximately % of the test samples) most relevant as determined by our method proposed in section 4.2. In Fig. 5 we plot the changes of the counterfactual explanations for both cases. We observe the same effects in both cases but with less noise in case of using only a few relevant samples - this suggests that our method from section 4.2 successfully identifies relevant samples for highlighting and explaining the specific model changes.
We follow the same procedure like in section 6.2 but this time we do not use all test samples but only the (approximately of the test samples) most relevant as determined by our method proposed in section 4.2. In Fig. 6 we plot the changes of the counterfactual explanations for both cases when switching from walking up-/downstairs to walking straight. We observe the same effects in both cases but with much less noise in case of using only a few relevant samples (we also clearly observe a little change in the Y axis of gyroscope which is not that strong in case of using all samples) - this suggests that our method from section 4.2 successfully identifies relevant samples for highlighting and explaining the specific model changes. Considering only the most relevant samples yields the same (but much stronger) results while saving a lot of computation time - this becomes even more handy when every sample has to be inspected manually (e.g. in some kind of manual quality assurance).
We fit a decision tree classifier to the first batch and completely refit the model to the first and second batch of the credit data set. The test data from both batches is used for computing counterfactual explanations for explaining the model changes. The changes in the counterfactual explanations for switching from “reject” to “accept” is shown in the left plot of Fig.7. We observe that after adapting the model to the second batch (recall that we split data based on age), there are a couple of cases where increasing the credit amount would turn a rejection into an acceptance which we consider as inappropriate and unwanted behaviour. We therefore use our proposed method for persistent local explanations from section 5.1 and section 5 to avoid this observed behaviour. The results of the constrained model adaptation is shown in the right plot of Fig. 7. We observe that now there is nearly no case in which increasing the credit amount turns a rejection into an acceptance - this suggests that our proposed method for persistent local explanations successfully pushed to the model towards our requested behaviour.
In this work we proposed to compare contrastive explanation as a proxy for explaining and understanding model adaptations - i.e. highlighting differences in the underlying decision making rules of the models. In this context, we also proposed a method for finding samples where the explanation changed significantly and thus might be illustrative for understanding the model adaptation. Finally, we proposed persistent constrative explanations for pushing the model adaptation towards a specific behaviour - i.e. ensuring that the model (after adaptation) satisfies some specified criteria. We empirically demonstrated the functionality of all our proposed methods.
In future research we would like to study the benefits of comparing contrastive explanations for explaining model adaptations from a psychological perspective - i.e. conducting a user study to learn more on how people perceive model adaptations and how useful they find these explanations for understanding and assessing model adaptations.
Given a sample and a correspond closest counterfactual (Definition 2) under a classifier , we can compute the weight vector of a locally linear approximation of the classifier between and as follows:
Given a sample and a closest counterfactual before the model drift and another one after the model drift, we can compute the corresponding locally linear approximations of the decision boundaries Eq. (28) and compute the cosine angle between the two weight vectors Eq. (28) as follows:
which concludes the proof. ∎
Working out , where , by making use of Eq. (30) and yields:
Applying the triangle inequality to Eq. (31) yields:
Applying the Cauchy-Schwarz inequality and making use of the assumption to Eq. (32) yields:
Substituting in Eq. (33) yields the stated bound:
which concludes the proof. We could also rewrite Eq. (34) as follows: