Unified Shapley Framework to Explain Prediction Drift

by   Aalok Shanbhag, et al.

Predictions are the currency of a machine learning model, and to understand the model's behavior over segments of a dataset, or over time, is an important problem in machine learning research and practice. There currently is no systematic framework to understand this drift in prediction distributions over time or between two semantically meaningful slices of data, in terms of the input features and points. We propose GroupShapley and GroupIG (Integrated Gradients), as axiomatically justified methods to tackle this problem. In doing so, we re-frame all current feature/data importance measures based on the Shapley value as essentially problems of distributional comparisons, and unify them under a common umbrella. We axiomatize certain desirable properties of distributional difference, and study the implications of choosing them empirically.



There are no comments yet.


page 8


Handling Concept Drift for Predictions in Business Process Mining

Predictive services nowadays play an important role across all business ...

Consistent Recalibration Models and Deep Calibration

Consistent Recalibration models (CRC) have been introduced to capture in...

Adversarial Validation Approach to Concept Drift Problem in Automated Machine Learning Systems

In automated machine learning systems, concept drift in input data is on...

Explaining Explanations: Axiomatic Feature Interactions for Deep Networks

Recent work has shown great promise in explaining neural network behavio...

A probability theoretic approach to drifting data in continuous time domains

The notion of drift refers to the phenomenon that the distribution, whic...

A Distributional Framework for Data Valuation

Shapley value is a classic notion from game theory, historically used to...

Adaptive Fraud Detection System Using Dynamic Risk Features

eCommerce transaction frauds keep changing rapidly. This is the major is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fundamental Machine Learning theory is structured around the assumption that the training and test data belong to the same distribution. While this is a reasonable assumption for learning theory, for practical usage this is difficult to ascertain. In deployed models, the incoming stream of data might start to be significantly different than the static dataset that the model was trained on – a phenomenon named Concept drift (gama2014survey; wang2015concept). Formally, concept or data drift is defined as the scenario where the distribution of the data , the label or the concept changes as compared to the training data that the model has seen. Previous drift detection methods have either focused on the overall error rate (gama2004learning)

, or some other combination of the confusion matrix

(wang2013concept), either of which require the prediction labels, which is not guaranteed for a machine learning model in production. Other work suggests using prediction drift as a proxy for concept drift in such cases (vzliobaite2010change). It informs of the change in the model’s prediction distribution, and this information may be of importance to the practitioner, even if the model’s accuracy is not impacted. For example, a lending company may have a quarterly target of loans to be disbursed, which may be achieved in a month, if the prediction distribution of real world applicants differs from training data, thereby causing issues in business planning. A systematic method is thus needed for studying prediction drift and attributing it to a) the features of the model and b) the individual data points that constitute the distributional samples that are compared. We frame the question as follows: Has the empirical distribution of inputs to the model drifted in a way that affects model behavior? If so, which features and which points in the sample have caused this shift?

For this attribution to features and data, we focus on Shapley value based methods (vstrumbelj2014explaining; lundberg2017unified; sundararajan2020many; datta2016algorithmic). We include Integrated Gradients (sundararajan2017axiomatic) in this broad family as it is equivalent to the Aumann-Shapley cost sharing method. Here, we adapt the Shapley framework, in the context of machine learning, for the following task: given two data samples of the same shape, and a function which computes some metric of distributional difference on the predictions made on the given datasets by a model , attribute the output of to each point of the target dataset, and to each feature. By using the Shapley framework, we automatically inherit the Shapley axioms that have certain desirable properties, which we discuss in section 4.2.

Currently, there is no consensus on which of the many distributional difference metrics should be used to for calculating prediction drift, with previous work using measures like Jensen Shannon divergence (pinto2019automatic), Kolmogorov–Smirnov test (dos2016fast) or the Wasserstein-1 distance (miroshnikov2020wasserstein). A comparative analysis of these methods is presented in Section 7. We demonstrate an axiomatic framework to choose the most appropriate distributional distance metric depending on the use case.


Our key contributions in this paper are:

  • Establishing an axiomatic framework for calculating and explaining prediction drift using Shapley values and IG

  • Extending the framework to explain arbitrary “groups” i.e. data and features together, thereby unifying several existing explanation methods

  • Applying the Shapley values formulation to a function of distributional difference

  • Axiomatization of measures of distributional difference

  • Empirical analysis of the implications of choosing a particular metric of distributional difference to measure prediction drift, over a few handcrafted examples

2 Related Work

Concept Drift

The problem of concept drift in machine learning has been extensively studied in the literature – spanning both sudden/instantaneous drift (sudden) or slow/gradual drift (stanley2003learning). Furthermore, the literature makes a distinction between “true” concept drift, and the “virtual” concept drift that happens due to a change in data distribution, essentially a sampling issue (salganicoff1997tolerating).

In the literature, popular methods for detecting concept drift are AD-WIN (bifet2007learning) and the Page-Hinkley Test (gama2013evaluating), with both these methods assuming that labels are available for analysis, which is infeasible in a scenario where the model is deployed and constantly making predictions on new data.

Prediction Drift as a proxy

In the absence of instantly available labels, other methods devolve to a measurement of the drift of the distributions of predictions as an ad-hoc method to detect concept drift. Methods using this approach include work by pinto2019automatic, DBLP:journals/corr/abs-1902-02808, dos2016fast and vzliobaite2010change. All these methods utilize different metrics to measure the difference in the distributions of the predictions of the new data points against a standard distribution.

Model explanations

Methods to describe the contribution of input features towards the final value of the prediction have gained considerable interest in the present, both from researchers and practitioners. One class of these methods effectively utilize Shapley value (shapley1953value)

, a popular concept in game theory to measure the contribution of each feature. Another method, Integrated Gradients

(sundararajan2017axiomatic), utilizes the path between the input and a baseline for each feature to measure the attribution of each feature on that path, as a specialized case of the Shapley-like cost sharing method (commonly referred to as Shap). These methods have grown in popularity for quantifying the impact of features on a prediction at an instance level (vstrumbelj2014explaining; datta2016algorithmic; lundberg2017unified), on the loss at a global level (covert2020understanding), as also for quantifying the contribution of individual data points to a model’s performance (ghorbani2020distributional). They have also been proposed to understand feature importance for measures of fairness (begley2020explainability; miroshnikov2020wasserstein).

3 Terminology

In this section, we lay out the terminology and notations that we use throughout the paper.

Model function

– the machine learning model function

, which takes a vector of features of shape

and returns an output(s). We limit the analysis to feature vectors instead of more general feature tensors, to avoid complications in notation. This does not mean however that the theory is particular to only models that accept single dimensional vectors, and can be extended quite easily.

Similarly, for the sake of simplicity and without loss of generality, for models which output a vector of values, we analyze only one output at a time. For example, classification models output a vector of length equal to the number of classes, of which there is a particular class of interest which we wish to analyze. . Akin to machine learning models with batch predict, is also able to accept a batch input of shape and return m outputs.

Sample (of points)

– in the context of a model, a sample of feature vectors of shape , the complete sample hence being of shape . The sample could be a single point () or multiple (). It could be chosen randomly from a distribution, or could be chosen intentionally as per requirements, e.g. points corresponding to men over the age of 50 from New York or the feature vector corresponding to ID “x” in a database of customers of an online retail store.


– the input sample of shape for which we want to explain the predictions, with respect to a particular model function.


– a sample with the same dimensions as the explicand, against which the explicand is explained. All Shapley value based methods have a baseline, though it may not be obvious due to being implicit in the formulation (sundararajan2020many; lundberg2017unified). The explanation is dependent on the choice of baseline, and various papers (merrick2020explanation) have proposed certain choices of baselines, or ways to select one.

Value Function

– the set function , that is used in the Shapley value formulation to obtain the attribution of each player. Here is the number of features, and refers to all possible combinations of feature presence (or absence).


– a measure of distributional difference, commonly used in context of time dependence, but we use it in a general sense.

Distributional drift function

– a function that given two samples (as defined above), returns a value characterizing the difference between them. We restrict ourselves to analyzing distributional differences over 1-D samples.


– combinations of the feature–data-point components belonging to the explicand. These groups play the role of “players” in co-operative game theory for the purpose of Shapley and IG attributions of the drift value. Groups can be defined semantically, for example - males and females, and can be formed as combinations both in the feature and data-point dimensions. The Shapley value is calculated on the marginals of the resulting groups as players that enter the coalition, over all such possible permutations.

4 Axioms

4.1 Axioms for attributions

From (sundararajan2020many; friedman1999three), we have the following desirable properties for attribution methods. In Section 6 we will formulate GroupShapley and GroupIG such that they are inherited. We state them here in terms of the group formulation for convenience. Reasons for their desirability are expanded on in the Appendix.

  1. Dummy - this axiom states that a group that doesn’t contribute to the game payout should get zero attribution.

  2. Efficiency - the sum of the attributions over all groups is equal to the difference of the model function’s output at the explicand and the baseline.

  3. Linearity - the attributions of the linear combination of the two model functions, are the same linear combination of the attributions of the model functions, taken one at a time.

  4. Symmetry - for model functions that are symmetric for two groups and , and the groups have the same value in both the explicand and baseline i.e. and , the attributions to both the groups should be the same.

  5. Affine Scale Invariance - requires the attributions to be invariant under the same affine transformation of both the model functions, and the groups.

  6. Demand Monotonicity - for a model function that is monotonic for a group, the attribution of the group should only increase if the value of the group increases.

  7. Proportionality - if the model function can be expressed as an additive sum of the input groups, and the baseline is zero, the attributions to each group are proportional to the group value.

4.2 Axioms for distributional drift functions

miroshnikov2020wasserstein propose some desirable properties for a distributional drift function:

  1. It should be continuous with respect to the change in the geometry of the distributions.

  2. It should be non-invariant with respect to monotone transformations of the distributions.

Since our focus is on the distributional samples, and not the distributions themselves, we restate these properties for distributional drift measures for two 1-D samples.

  1. Sensitivity - the drift function should be continuous with respect to changes in the individual points in the samples. For example, given two 1-D samples and , if we change the value of any point in either, the function output should change.

  2. Differentiability - the drift function should be differentiable with respect to the individual points in the samples - this is a stronger version of the continuity axiom

  3. Symmetry - the drift function of two samples and should be symmetric i.e.

  4. Identity of Indiscernibles - the drift is zero if and only if both samples are the same and if

  5. Directionality - the drift is signed based on the sample order . A metric cannnot satisfy both Symmetry and Directionality, unless it’s always zero

5 Prediction Drift

We define prediction drift as the change in the distrbiution of the predictions of a model between two semantically meaningful slices of data.

The need for studying prediction drift to answer the question raised above arises due to the following reasons:

  1. Detecting drift in the distribution of individual features may not be sufficient. For instance, it could be that the predictions may drift despite no drift in any of the individual feature distributions. This is because the joint distributions of the features may have drift.

    Reference Distribution Target Distribution
    x y z f(x, y, z) x y z f(x, y, z)
    1.0 1.0 1.0 3.0 3.0 1.0 2.0 9.0
    2.0 2.0 2.0 8.0 1.0 2.0 3.0 8.0
    3.0 3.0 3.0 15.0 2.0 3.0 1.0 6.0
    Table 1: The model function is . The , and distributions are unchanged at the univariate level, but the multivariate distribution has changed, so has the prediction distribution.
  2. Furthermore, drift in individual features may not always lead to drift in predictions. This could, for instance, happen if the drifting feature is unimportant to the model.

  3. Finally, detecting drift in the prediction distributions may not be sufficient either. For instance, while the predictions distributions may remain the same, it could still be that the input feature distributions have changed in a meaningful way that affects how the model reasons. Such a drift is still worth noting. For instance, the camera that feeds a face detection model could rotate over time, due to hinge failure. A robust model will be able to handle the distortion of the image for a while before it fails. The prediction distribution will not change initially, but the feature attributions over the pixels regions will change, which can serve as an early warning system.

We focus our attention on problems 1 and 2, leaving 3 for future work. To answer the aforementioned question, we rely on the following steps:

  • Measure prediction drift for the model given two slices of data

  • Attribute the drift to meaningful groups in the data.

Possible meaningful groups could be features of the model, n-tile buckets of predictions, or rule-based slices such as males vs females. We need to be careful to ensure that the the number of observations in each slice is proportionally similar for each sample, to avoid statistical anomalies seen in Simpson’s paradox. (simpson)

Practically, for calculating the prediction drift given two data samples of unequal and/or large size, we suggest a bootstrapping approach. We sample from the two empirical distributions for a given number of repetitions and calculate the expected value of the prediction drift and the attributions and obtain statistical confidence bounds.

6 Group Shapley and Group IG Formulation

6.1 The Shapley value

Model function is

The Shapley value of a player , playing an n-player coalitional game with a payout function is defined as


6.2 Baseline Shapley

Baseline Shapley (sundararajan2020many) or BShap, takes a function , an explicand and a baseline .

The value or payout function is

Here, the absence of a feature is modeled using the corresponding baseline value. BShap is equivalent to the Shapley-Shubik cost sharing method and satisfies the following axioms: Dummy, Linearity, Affine Scale Invariance, Demand Monotonicity, and Symmetry.

6.3 Integrated Gradients

The Integrated Gradients formulation is


Integrated gradients is equivalent to the Aumann-Shapley cost sharing method for continuous functions.

Integrated Gradients satisfies the following axioms: Dummy, Linearity, Affine Scale Invariance, Proportionality, and Symmetry

6.4 Drift Group Shapley

We define Drift Group Shapley, or GroupShapley, as being parametrized by the following choices:

  1. Choice of the explicand of shape

  2. Choice of the baseline of the same shape as the explicand

  3. A model function

  4. Additional functions, the chain of which we call , which return two real valued outputs of equal shape for both the explicand and the sample

  5. Choice of a distributional difference function , that takes two equal shaped outputs of the function and returns a real valued output

The group formulation is:


In GroupShapley, we explain the drift between the output of the explicand and the baseline. The number of players is equal to the number of groups times the number of features. The number of groups is the number of sub-divisions across rows. If the whole sample is one group, the features are the only players. If we have a row as it’s own group, we end up with number of rows number of features groups to which we attribute the payout. To be precise, we are attributing the drift score to each group in the explicand, where a group is a cross section consisting of at least one row and at most all rows, and at least one feature or at most features.

To simulate for missingness of a player, we replace the group of interest, with it’s aligned counterpart from the reference dataset, similar to notion of the baseline in BShap or IG.

We now propose to frame every existing Shapley formulation as a prediction drift between some aspect of the model’s behavior at the explicand and the baseline. We re-frame the two questions as:

  1. Has the empirical distribution of inputs to the model drifted in a way that affects model behavior? becomes Is there a difference in groups between the explicand and the baseline that affects some aspect of model behavior?

  2. If so, which features and which points in the sample have caused this shift? becomes If so, which groups have caused it?

We list the following existing methods which we attempt to bring under a common umbrella:

  1. (merrick2020explanation) unifies BShap/KernelSHAP/QII, noting that the KernelSHAP (CES) and QII (RBShap) can be derived by taking the expectation of BShap over particular distributions, namely the input distribution for KernelSHAP and the joint marginal for QII. The approach in (vstrumbelj2014explaining) is equivalent to kernelSHAP (sundararajan2020many)

    Therefore, we can consider KernelSHAP and QII to be the following case of GroupShapley: Explicand is of shape , broadcast to where m is the size of the background sample over which the expectation is calculated. The groups are the features and the drift function is the expected value difference.

  2. SAGE (covert2020understanding) is a global explanation method, where the aim is to attribute the loss of the model to the features, by suggesting that a feature whose removal increases the loss is more important. The loss is computed over a data sample of shape . They propose using the conditional distribution as in CES in theory, but in practice use the marginal, as in RBShap. This is equivalent to GroupShapley on groups, broadcasting the row dimension to where is the size of the background baseline sample the applicable distribution. The drift function is the expected value difference.

  3. Distributional Shapley (ghorbani2020distributional)

    aims to find the value of a data point, given a model and an evaluation metric. There is no inherent concept of a baseline here, though we could trivially add a set of random data as the baseline. We can design

    so as to make the of the baseline to be zero. The drift function is the expected value difference between the accuracy on the explicand and the artificially created zero value accuracy of the baseline. We note that it may be more instructive to introduce the notion of a baseline here, so as to ground the value of a datum in more definite terms. For example, is the data from source A more informative than source B.

  4. In (miroshnikov2020wasserstein), they propose using Shapley values to explain the Wasserstein-1 distance between two prediction samples, each belonging to a class of a protected attribute like Gender, Race and so on. This is directly analogous to our scheme.

6.5 Drift Group Integrated Gradients

We define Drift Group IG, or GroupIG, as being parametrized by the following choices:

  1. Choice of the explicand of shape

  2. Choice of the baseline of the same shape as the explicand

  3. A model function that is end-to-end differentiable with respect to the inputs

  4. Additional functions, the chain of which we call , which return two real valued outputs of equal shape for both the explicand and the sample. has to be differentiable in terms of the individual samples

  5. Choice of a distributional difference function , that takes two equal shaped outputs of the function and returns a real valued output. Again, has to be differentiable in terms of the original input samples

In GroupIG, we go from the baseline sample to the explicand in a straight line path. We can thus say that IG is a particular case of Drift Group IG, where m = 1, G is the identity function and the distributional difference function is the expected value difference. If we are using the Wasserstein-1 distance for a single input, we re-frame the function as the absolute distance between the prediction at the input and the baseline prediction.

7 Distributional Distance Metrics

We now discuss the properties of some of the widely used distance metrics for distances between two 1-D samples , of data of length .

7.1 Wasserstein-1 Distance

The Wasserstein-1 distance, also called the Earth Mover’s distance or Mallows distance, is a well known metric from optimal transport theory, and widely used in statistics and machine learning. The mathematical properties which aid its suitability for our task are discussed below, building on prior work (kolouri2018sliced; miroshnikov2020wasserstein; jiang2020wasserstein)

For the case of two 1-D samples, which is the case we are focusing on, the distance is the norm of the sorted samples. The Wasserstein-1 distance is the special case where is 1.

. Hence for p = 1, it reduces to the mean of the L1 norm. (levina2001earth)

The distance for empirical samples satisfies the following distributional axioms: Sensitivity, Differentiability, Symmetry, and the Identity of Indiscernibles. (Proofs in Appendix)

7.2 Expected value difference

Expected value difference, can be understood simply as the difference in the Expected Value of two distributions. Given two samples, it’s the difference in the mean. This is a very intuitive concept, and is the simplest measurement of distributional difference, corresponding to a change in the first order moment.


The Expected value distance for empirical samples satisfies the following distributional axioms: Sensitivity, Differentiability, and Directionality but not the Identity of Indiscernibles. (Proofs in Appendix)

7.3 Jensen Shannon Divergence

The Jensen Shannon Divergence (JSD) given two probability distributions P and Q is defined as

where and is the Kullback-Liebler divergence.

While it is difficult to analyze JSD’s behavior given empirical samples, we can see that it does not satisfy Sensitivity and Directionality. (Proofs in Appendix)

7.4 Kolmogorov-Smirnov Test Statistic for Two Samples

This is actually a test to determine if two empirical probability distributions differ, and yields a distance that is used as measure of distributional difference.


The KS statistic distance is defined as

where and

are the empirical Cumulative Distribution Functions (CDF) of

and and is the supremum.

The KS Statistic satisfies only the Symmetry and the Identity of Indiscernibles axiom. (miroshnikov2020wasserstein)

Function(x, y, z) Explicand [x, y, z] Baseline [x, y, z] Exp. value Difference Distance Shapley Shapley IG IG
[1, 2, 3] [0, 0, 0] 2.0 2.0 [1. 1. 0.] [1. 1. 0.] [1. 1. 0.] [1. 1. 0.]
[1, 2, 3] [0, 0, 0] -1.0 1.0 [ 1. -2. 0.] [0. 1. 0.] [ 1. -2. 0.] [-1. 2. 0.]
[1, 2, 3] [0, 0, 0] 0.0 0.0 [ 1. 2. -3.] [0. 0. 0.] [ 1. 2. -3.] [0. 0. 0.]
[1, 2, 3] [0, 0, 0] -7.0 7.0 [ 1. 1. -9.] [-0.33 -0.33 7.67] [ 1., 1., -9., 0.] [-1., -1., 9.]
[1, 2, 3] [0, 0, 0] 1.0 1.0 [0.5 0.5 0. ] [0.5 0.5 0.] [1 0 0. ] [1 0 0. ]
[1, 2, 3] [0, 0, 0] 1.0 1.0 [0. 1. 0.] [0. 1. 0.] [-1 2 0] [-1 2 0]
Table 2: BShap and IG Attributions for functions using Expected Value Difference and Wasserstein-1. Note the sparser attributions using the Wasserstein distance

8 The concept of Alignment

Given the need of a baseline in the Shapley value and IG formulations, it is natural to ask what is the right baseline, given that the attributions will differ with the choice of baseline. This is one of the most important questions in explainability.(sundararajan2017axiomatic) recommends choosing a baseline where the model’s prediction is neutral. (merrick2020explanation) argues for contrastive explanations, with justification from norm theory (kahneman1986norm).

In GroupShapley and GroupIG when using the drift function, we take the counterpart in the other sample as baseline, when both samples are aligned by their sorted prediction values. The distance is based on the concept of optimal transport, and hence, the intuition extends naturally to the flow from the attributions, which make up the prediction from one distribution to the other.

For other drift metrics, there may not be a natural reason to align in any particular way. But the alignment of the distance still can be justified as comparing the most similar points in the two samples, if the prediction of model is viewed as a task specific dimensionality reduction. Fliptest (black2020fliptest) uses a similar thought process for assessing individual fairness by creating counterfactuals via optimal transport.

The alternative, where no choice needs to be made, is to take the expectation over all possible alignments.

9 Analysis

We now look at some practical examples of how the choice of drift function impacts the explanations.

9.1 Simple Experiments

We analyze BShap and IG for a few functions in Table 2, using both the expected value difference and the distance. These are functions of three variables x, y, and z, the baseline for all is [0, 0, 0], and the explicand is [1,2,3]. For the function , we see that the attributions are different for different for both BShap and IG. It seems that the drift function gives sparser attributions for BShap, by compressing the attributions for the features that act in opposite direction to the eventual predicted value. For instance, for , is 1 and is 2, so the prediction is -1, and the distance from the baseline prediction is 1. The method gives all the attribution to , as it has the sign of the prediction. We can see this behavior for and

as well. This is reminiscent of how the L1 norm sparsifies coefficients in ridge regression, but we make no claims of there being any analogy between the two.

There is no reason to always prefer the explanation of one over the other, both can be justified in their own way and are a matter of choice, similar to how choosing a baseline is a choice depending on the question one is looking to answer.

9.2 Case Study

We now present a simple case study, to demonstrate how this might work in practice, by constructing a synthetic dataset. This allows us to inject known and controlled drifts in order to evaluate the effectiveness of various methods at finding them.

We create a dataset of the following features:

  1. [noitemsep]

  2. Location - {‘Springfield’, ‘Centerville’} - 70:30

  3. Education - {‘GRAD’,‘POST_GRAD’} - 80:20

  4. Experience - years - (0, 50) - normally distributed

  5. Engineer Type - {‘Software’,‘Hardware’} - 85:15

  6. Relevant Experience - years - (0, 50) - normally distributed

and ensure that experience relevant experience.

The model predicts an individual’s salary from the features above, using the following formula:

2000 events are created for each of three days. On the second day, a plausible data pipeline bug is introduced, whereby the location feature has the value “springfield” rather than “Springfield”. Because of this, all locations are identified as ‘Centerville’, which leads to an average salary drop for day two– a prediction drift. We now would like to attribute this the offending features. Figure 1 shows the drift over time measured by the various drift methods previously discussed.

In Figure 2, we calculate GroupShapley attributions over the fifteen feature-day combinations, and see that the job location feature gets the most attribution, as we would expect.

Additionally, we compare our approach to that of (pinto2019automatic)

, which measures drift using Jensen Shannon divergence and trains a Gradient Boosted Tree Classifier to identify the drift. The feature importances of the classifier are used to identify the cause. In the scenario described, it correctly gives the most attribution to the location feature. But if we introduce another spurious drift, of an unimportant feature like experience, the GBDT method selects the wrong feature. They do suggest a technique to remove time-trended features, but if the other feature also spikes in the same interval, that fix will not help either as seen in Figure


Figure 1: Comparison of four distributional drift functions. Wasserstein and means/expected value difference [upper] preserve the units of the predicted quantity and may provide a more intuitive scale. The Kolmogorov-Smirnov Statistic and Jensen Shannon Divergence [lower] are dimensionless and the scale reflects the degree of absolute distributional overlap. The central 20 periods have a data integrity error intentionally introduced which causes some applicants to have their location misinterpreted.
Figure 2: measures a drift of $7022.46 over the complete dataset. By forming groups of [features][days], GroupShapley unambiguously identifies the source of drift as the “Location” feature on the second day.
Figure 3: Comparison of tree-based feature importance method from (pinto2019automatic) and GroupShapley. Both methods initially identify the correct source of drift; but when an additional correlated feature drift is added, the GBDT method assigns it most of the importance, despite its minimal effect on the model output.

10 Conclusion and Future Work

We study the problem of prediction drift and attributing it, and propose it as a general framework of explainability, unifying several methods. We axiomatize certain desirable properties of distributional difference metrics, also demonstrating that explanation methods can be parameterized by the choice of this metric.

A more detailed study of the theoretical implications of choosing one distance metric over another for explanations is left for future work. Additionally, GroupShapley can be computationally expensive, and approximation schemes for faster calculations could be a future area of exploration.

11 Appendix

11.1 Axioms

We will now go over the reasons for the desirability of the axioms:

  1. Dummy - We do not want to credit a group/feature that makes no contribution to the model prediction.

  2. Efficiency - This ensures a complete accounting of difference in the model’s prediction between the explicand and the baseline.

  3. Linearity - This property helps in avoiding counter-intuitive behavior when analyzing attributions of linear functions.

  4. Symmetry - The purpose of this axiom is self-evident, if two groups contribute equally they should receive the same attribution.

  5. Affine Scale Invariance - The justification for this is based on the idea that the units of measurement of individual features may not be comparable to each other, and secondly, within themselves, may not be canonical. For example, units of weight like pounds or kilograms are not more or less justified than the other, and the conversion to the other should not lead to a decrease in attribution. (friedman1999three)

  6. Demand Monotonicity - For a function that is monotonic with respect to a group, if the group value increases while all else is held constant, the function’s value will increase. It is natural to want the attribution to the group to increase as compared to the previous scenario.

  7. Proportionality - This ensures that the attributions to groups are proportional to their contribution in the additive sum of the group values. Let’s look at a heat generation scenario. If there are three current sources, each supplying the same amount of current. The heat generated is proportional to the square of the current. The attribution to each should be one-third, compared to the zero baseline. Now if we combine two of the current sources, the attribution of the third should ideally remain the same.

11.2 Proofs for Drift Metrics satisfying Axioms

Wasserstein-1 Distance

Given two samples and of length and sorted by value, the distance can be computed .

The distance for empirical samples satisfies the following distributional axioms:


  1. Sensitivity - This is trivial to see, given that each point of the sample contributes to the overall sum.

  2. Differentiability - The function is piece-wise differentiable, except at zero for each absolute difference.

  3. Symmetry - The formula is symmetric in and .

  4. Identity of Indiscernibles - The distance can be zero only if every element-pair in the two samples cancels each other out.

Expected value difference

Given two samples and of length , the Expected value distance is .


  1. Sensitivity - Each point of the sample contributes to the overall sum.

  2. Differentiability - One can see that the function is differentiable everywhere.

  3. Directionality - The sign changes when the sample order is flipped.

  4. Identity of Indiscernibles - This can be proved by a counter example. If there is a sample that only has values 1, and the other has equal number of zeros and twos. The two means will be equal and will cancel out, even though the two samples are not the same.

Jensen Shannon Divergence

The Jensen Shannon Divergence (JSD) given two probability distributions P and Q is defined as where and is the Kullback-Liebler divergence.

While it is difficult to analyze JSD’s behavior given empirical samples, we can see that it does not satisfy Sensitivity and Directionality.


  1. Sensitivity - This can be proved by a counter example. If there are two distributions that don’t intersect anywhere, the JSD is one. Now if we translate the second distribution while ensuring there is no intersection, the JSD is still 1.

  2. Directionality - JSD is symmetric to the change in the sample order.

Kolmogorov-Smirnov Test Statistic for Two Samples

For two distributions and , the KS statistic distance is where and are the empirical Cumulative Distribution Functions (CDF) of and

We can see from the definition that the KS Statistic satisfies Symmetry and the Identity of Indiscernibles axiom. For the other proofs please refer to (miroshnikov2020wasserstein)