1 Introduction
Fundamental Machine Learning theory is structured around the assumption that the training and test data belong to the same distribution. While this is a reasonable assumption for learning theory, for practical usage this is difficult to ascertain. In deployed models, the incoming stream of data might start to be significantly different than the static dataset that the model was trained on – a phenomenon named Concept drift (gama2014survey; wang2015concept). Formally, concept or data drift is defined as the scenario where the distribution of the data , the label or the concept changes as compared to the training data that the model has seen. Previous drift detection methods have either focused on the overall error rate (gama2004learning)
, or some other combination of the confusion matrix
(wang2013concept), either of which require the prediction labels, which is not guaranteed for a machine learning model in production. Other work suggests using prediction drift as a proxy for concept drift in such cases (vzliobaite2010change). It informs of the change in the model’s prediction distribution, and this information may be of importance to the practitioner, even if the model’s accuracy is not impacted. For example, a lending company may have a quarterly target of loans to be disbursed, which may be achieved in a month, if the prediction distribution of real world applicants differs from training data, thereby causing issues in business planning. A systematic method is thus needed for studying prediction drift and attributing it to a) the features of the model and b) the individual data points that constitute the distributional samples that are compared. We frame the question as follows: Has the empirical distribution of inputs to the model drifted in a way that affects model behavior? If so, which features and which points in the sample have caused this shift?For this attribution to features and data, we focus on Shapley value based methods (vstrumbelj2014explaining; lundberg2017unified; sundararajan2020many; datta2016algorithmic). We include Integrated Gradients (sundararajan2017axiomatic) in this broad family as it is equivalent to the AumannShapley cost sharing method. Here, we adapt the Shapley framework, in the context of machine learning, for the following task: given two data samples of the same shape, and a function which computes some metric of distributional difference on the predictions made on the given datasets by a model , attribute the output of to each point of the target dataset, and to each feature. By using the Shapley framework, we automatically inherit the Shapley axioms that have certain desirable properties, which we discuss in section 4.2.
Currently, there is no consensus on which of the many distributional difference metrics should be used to for calculating prediction drift, with previous work using measures like Jensen Shannon divergence (pinto2019automatic), Kolmogorov–Smirnov test (dos2016fast) or the Wasserstein1 distance (miroshnikov2020wasserstein). A comparative analysis of these methods is presented in Section 7. We demonstrate an axiomatic framework to choose the most appropriate distributional distance metric depending on the use case.
Contributions
Our key contributions in this paper are:

Establishing an axiomatic framework for calculating and explaining prediction drift using Shapley values and IG

Extending the framework to explain arbitrary “groups” i.e. data and features together, thereby unifying several existing explanation methods

Applying the Shapley values formulation to a function of distributional difference

Axiomatization of measures of distributional difference

Empirical analysis of the implications of choosing a particular metric of distributional difference to measure prediction drift, over a few handcrafted examples
2 Related Work
Concept Drift
The problem of concept drift in machine learning has been extensively studied in the literature – spanning both sudden/instantaneous drift (sudden) or slow/gradual drift (stanley2003learning). Furthermore, the literature makes a distinction between “true” concept drift, and the “virtual” concept drift that happens due to a change in data distribution, essentially a sampling issue (salganicoff1997tolerating).
In the literature, popular methods for detecting concept drift are ADWIN (bifet2007learning) and the PageHinkley Test (gama2013evaluating), with both these methods assuming that labels are available for analysis, which is infeasible in a scenario where the model is deployed and constantly making predictions on new data.
Prediction Drift as a proxy
In the absence of instantly available labels, other methods devolve to a measurement of the drift of the distributions of predictions as an adhoc method to detect concept drift. Methods using this approach include work by pinto2019automatic, DBLP:journals/corr/abs190202808, dos2016fast and vzliobaite2010change. All these methods utilize different metrics to measure the difference in the distributions of the predictions of the new data points against a standard distribution.
Model explanations
Methods to describe the contribution of input features towards the final value of the prediction have gained considerable interest in the present, both from researchers and practitioners. One class of these methods effectively utilize Shapley value (shapley1953value)
, a popular concept in game theory to measure the contribution of each feature. Another method, Integrated Gradients
(sundararajan2017axiomatic), utilizes the path between the input and a baseline for each feature to measure the attribution of each feature on that path, as a specialized case of the Shapleylike cost sharing method (commonly referred to as Shap). These methods have grown in popularity for quantifying the impact of features on a prediction at an instance level (vstrumbelj2014explaining; datta2016algorithmic; lundberg2017unified), on the loss at a global level (covert2020understanding), as also for quantifying the contribution of individual data points to a model’s performance (ghorbani2020distributional). They have also been proposed to understand feature importance for measures of fairness (begley2020explainability; miroshnikov2020wasserstein).3 Terminology
In this section, we lay out the terminology and notations that we use throughout the paper.
 Model function

– the machine learning model function
, which takes a vector of features of shape
and returns an output(s). We limit the analysis to feature vectors instead of more general feature tensors, to avoid complications in notation. This does not mean however that the theory is particular to only models that accept single dimensional vectors, and can be extended quite easily.
Similarly, for the sake of simplicity and without loss of generality, for models which output a vector of values, we analyze only one output at a time. For example, classification models output a vector of length equal to the number of classes, of which there is a particular class of interest which we wish to analyze. . Akin to machine learning models with batch predict, is also able to accept a batch input of shape and return m outputs.
 Sample (of points)

– in the context of a model, a sample of feature vectors of shape , the complete sample hence being of shape . The sample could be a single point () or multiple (). It could be chosen randomly from a distribution, or could be chosen intentionally as per requirements, e.g. points corresponding to men over the age of 50 from New York or the feature vector corresponding to ID “x” in a database of customers of an online retail store.
 Explicand

– the input sample of shape for which we want to explain the predictions, with respect to a particular model function.
 Baseline

– a sample with the same dimensions as the explicand, against which the explicand is explained. All Shapley value based methods have a baseline, though it may not be obvious due to being implicit in the formulation (sundararajan2020many; lundberg2017unified). The explanation is dependent on the choice of baseline, and various papers (merrick2020explanation) have proposed certain choices of baselines, or ways to select one.
 Value Function

– the set function , that is used in the Shapley value formulation to obtain the attribution of each player. Here is the number of features, and refers to all possible combinations of feature presence (or absence).
 Drift

– a measure of distributional difference, commonly used in context of time dependence, but we use it in a general sense.
 Distributional drift function

– a function that given two samples (as defined above), returns a value characterizing the difference between them. We restrict ourselves to analyzing distributional differences over 1D samples.
 Groups

– combinations of the feature–datapoint components belonging to the explicand. These groups play the role of “players” in cooperative game theory for the purpose of Shapley and IG attributions of the drift value. Groups can be defined semantically, for example  males and females, and can be formed as combinations both in the feature and datapoint dimensions. The Shapley value is calculated on the marginals of the resulting groups as players that enter the coalition, over all such possible permutations.
4 Axioms
4.1 Axioms for attributions
From (sundararajan2020many; friedman1999three), we have the following desirable properties for attribution methods. In Section 6 we will formulate GroupShapley and GroupIG such that they are inherited. We state them here in terms of the group formulation for convenience. Reasons for their desirability are expanded on in the Appendix.

Dummy  this axiom states that a group that doesn’t contribute to the game payout should get zero attribution.

Efficiency  the sum of the attributions over all groups is equal to the difference of the model function’s output at the explicand and the baseline.

Linearity  the attributions of the linear combination of the two model functions, are the same linear combination of the attributions of the model functions, taken one at a time.

Symmetry  for model functions that are symmetric for two groups and , and the groups have the same value in both the explicand and baseline i.e. and , the attributions to both the groups should be the same.

Affine Scale Invariance  requires the attributions to be invariant under the same affine transformation of both the model functions, and the groups.

Demand Monotonicity  for a model function that is monotonic for a group, the attribution of the group should only increase if the value of the group increases.

Proportionality  if the model function can be expressed as an additive sum of the input groups, and the baseline is zero, the attributions to each group are proportional to the group value.
4.2 Axioms for distributional drift functions
miroshnikov2020wasserstein propose some desirable properties for a distributional drift function:

It should be continuous with respect to the change in the geometry of the distributions.

It should be noninvariant with respect to monotone transformations of the distributions.
Since our focus is on the distributional samples, and not the distributions themselves, we restate these properties for distributional drift measures for two 1D samples.

Sensitivity  the drift function should be continuous with respect to changes in the individual points in the samples. For example, given two 1D samples and , if we change the value of any point in either, the function output should change.

Differentiability  the drift function should be differentiable with respect to the individual points in the samples  this is a stronger version of the continuity axiom

Symmetry  the drift function of two samples and should be symmetric i.e.

Identity of Indiscernibles  the drift is zero if and only if both samples are the same and if

Directionality  the drift is signed based on the sample order . A metric cannnot satisfy both Symmetry and Directionality, unless it’s always zero
5 Prediction Drift
We define prediction drift as the change in the distrbiution of the predictions of a model between two semantically meaningful slices of data.
The need for studying prediction drift to answer the question raised above arises due to the following reasons:

Detecting drift in the distribution of individual features may not be sufficient. For instance, it could be that the predictions may drift despite no drift in any of the individual feature distributions. This is because the joint distributions of the features may have drift.
Reference Distribution Target Distribution x y z f(x, y, z) x y z f(x, y, z) 1.0 1.0 1.0 3.0 3.0 1.0 2.0 9.0 2.0 2.0 2.0 8.0 1.0 2.0 3.0 8.0 3.0 3.0 3.0 15.0 2.0 3.0 1.0 6.0 Table 1: The model function is . The , and distributions are unchanged at the univariate level, but the multivariate distribution has changed, so has the prediction distribution. 
Furthermore, drift in individual features may not always lead to drift in predictions. This could, for instance, happen if the drifting feature is unimportant to the model.

Finally, detecting drift in the prediction distributions may not be sufficient either. For instance, while the predictions distributions may remain the same, it could still be that the input feature distributions have changed in a meaningful way that affects how the model reasons. Such a drift is still worth noting. For instance, the camera that feeds a face detection model could rotate over time, due to hinge failure. A robust model will be able to handle the distortion of the image for a while before it fails. The prediction distribution will not change initially, but the feature attributions over the pixels regions will change, which can serve as an early warning system.
We focus our attention on problems 1 and 2, leaving 3 for future work. To answer the aforementioned question, we rely on the following steps:

Measure prediction drift for the model given two slices of data

Attribute the drift to meaningful groups in the data.
Possible meaningful groups could be features of the model, ntile buckets of predictions, or rulebased slices such as males vs females. We need to be careful to ensure that the the number of observations in each slice is proportionally similar for each sample, to avoid statistical anomalies seen in Simpson’s paradox. (simpson)
Practically, for calculating the prediction drift given two data samples of unequal and/or large size, we suggest a bootstrapping approach. We sample from the two empirical distributions for a given number of repetitions and calculate the expected value of the prediction drift and the attributions and obtain statistical confidence bounds.
6 Group Shapley and Group IG Formulation
6.1 The Shapley value
Model function is
The Shapley value of a player , playing an nplayer coalitional game with a payout function is defined as
(1) 
6.2 Baseline Shapley
Baseline Shapley (sundararajan2020many) or BShap, takes a function , an explicand and a baseline .
The value or payout function is
Here, the absence of a feature is modeled using the corresponding baseline value. BShap is equivalent to the ShapleyShubik cost sharing method and satisfies the following axioms: Dummy, Linearity, Affine Scale Invariance, Demand Monotonicity, and Symmetry.
6.3 Integrated Gradients
The Integrated Gradients formulation is
(2) 
Integrated gradients is equivalent to the AumannShapley cost sharing method for continuous functions.
Integrated Gradients satisfies the following axioms: Dummy, Linearity, Affine Scale Invariance, Proportionality, and Symmetry
6.4 Drift Group Shapley
We define Drift Group Shapley, or GroupShapley, as being parametrized by the following choices:

Choice of the explicand of shape

Choice of the baseline of the same shape as the explicand

A model function

Additional functions, the chain of which we call , which return two real valued outputs of equal shape for both the explicand and the sample

Choice of a distributional difference function , that takes two equal shaped outputs of the function and returns a real valued output
The group formulation is:
(3) 
In GroupShapley, we explain the drift between the output of the explicand and the baseline. The number of players is equal to the number of groups times the number of features. The number of groups is the number of subdivisions across rows. If the whole sample is one group, the features are the only players. If we have a row as it’s own group, we end up with number of rows number of features groups to which we attribute the payout. To be precise, we are attributing the drift score to each group in the explicand, where a group is a cross section consisting of at least one row and at most all rows, and at least one feature or at most features.
To simulate for missingness of a player, we replace the group of interest, with it’s aligned counterpart from the reference dataset, similar to notion of the baseline in BShap or IG.
We now propose to frame every existing Shapley formulation as a prediction drift between some aspect of the model’s behavior at the explicand and the baseline. We reframe the two questions as:

Has the empirical distribution of inputs to the model drifted in a way that affects model behavior? becomes Is there a difference in groups between the explicand and the baseline that affects some aspect of model behavior?

If so, which features and which points in the sample have caused this shift? becomes If so, which groups have caused it?
We list the following existing methods which we attempt to bring under a common umbrella:

(merrick2020explanation) unifies BShap/KernelSHAP/QII, noting that the KernelSHAP (CES) and QII (RBShap) can be derived by taking the expectation of BShap over particular distributions, namely the input distribution for KernelSHAP and the joint marginal for QII. The approach in (vstrumbelj2014explaining) is equivalent to kernelSHAP (sundararajan2020many)
Therefore, we can consider KernelSHAP and QII to be the following case of GroupShapley: Explicand is of shape , broadcast to where m is the size of the background sample over which the expectation is calculated. The groups are the features and the drift function is the expected value difference.

SAGE (covert2020understanding) is a global explanation method, where the aim is to attribute the loss of the model to the features, by suggesting that a feature whose removal increases the loss is more important. The loss is computed over a data sample of shape . They propose using the conditional distribution as in CES in theory, but in practice use the marginal, as in RBShap. This is equivalent to GroupShapley on groups, broadcasting the row dimension to where is the size of the background baseline sample the applicable distribution. The drift function is the expected value difference.

Distributional Shapley (ghorbani2020distributional)
aims to find the value of a data point, given a model and an evaluation metric. There is no inherent concept of a baseline here, though we could trivially add a set of random data as the baseline. We can design
so as to make the of the baseline to be zero. The drift function is the expected value difference between the accuracy on the explicand and the artificially created zero value accuracy of the baseline. We note that it may be more instructive to introduce the notion of a baseline here, so as to ground the value of a datum in more definite terms. For example, is the data from source A more informative than source B. 
In (miroshnikov2020wasserstein), they propose using Shapley values to explain the Wasserstein1 distance between two prediction samples, each belonging to a class of a protected attribute like Gender, Race and so on. This is directly analogous to our scheme.
6.5 Drift Group Integrated Gradients
We define Drift Group IG, or GroupIG, as being parametrized by the following choices:

Choice of the explicand of shape

Choice of the baseline of the same shape as the explicand

A model function that is endtoend differentiable with respect to the inputs

Additional functions, the chain of which we call , which return two real valued outputs of equal shape for both the explicand and the sample. has to be differentiable in terms of the individual samples

Choice of a distributional difference function , that takes two equal shaped outputs of the function and returns a real valued output. Again, has to be differentiable in terms of the original input samples
In GroupIG, we go from the baseline sample to the explicand in a straight line path. We can thus say that IG is a particular case of Drift Group IG, where m = 1, G is the identity function and the distributional difference function is the expected value difference. If we are using the Wasserstein1 distance for a single input, we reframe the function as the absolute distance between the prediction at the input and the baseline prediction.
7 Distributional Distance Metrics
We now discuss the properties of some of the widely used distance metrics for distances between two 1D samples , of data of length .
7.1 Wasserstein1 Distance
The Wasserstein1 distance, also called the Earth Mover’s distance or Mallows distance, is a well known metric from optimal transport theory, and widely used in statistics and machine learning. The mathematical properties which aid its suitability for our task are discussed below, building on prior work (kolouri2018sliced; miroshnikov2020wasserstein; jiang2020wasserstein)
For the case of two 1D samples, which is the case we are focusing on, the distance is the norm of the sorted samples. The Wasserstein1 distance is the special case where is 1.
. Hence for p = 1, it reduces to the mean of the L1 norm. (levina2001earth)
The distance for empirical samples satisfies the following distributional axioms: Sensitivity, Differentiability, Symmetry, and the Identity of Indiscernibles. (Proofs in Appendix)
7.2 Expected value difference
Expected value difference, can be understood simply as the difference in the Expected Value of two distributions. Given two samples, it’s the difference in the mean. This is a very intuitive concept, and is the simplest measurement of distributional difference, corresponding to a change in the first order moment.
.
The Expected value distance for empirical samples satisfies the following distributional axioms: Sensitivity, Differentiability, and Directionality but not the Identity of Indiscernibles. (Proofs in Appendix)
7.3 Jensen Shannon Divergence
The Jensen Shannon Divergence (JSD) given two probability distributions P and Q is defined as
where and is the KullbackLiebler divergence.
While it is difficult to analyze JSD’s behavior given empirical samples, we can see that it does not satisfy Sensitivity and Directionality. (Proofs in Appendix)
7.4 KolmogorovSmirnov Test Statistic for Two Samples
This is actually a test to determine if two empirical probability distributions differ, and yields a distance that is used as measure of distributional difference.
(dos2016fast)The KS statistic distance is defined as
The KS Statistic satisfies only the Symmetry and the Identity of Indiscernibles axiom. (miroshnikov2020wasserstein)
Function(x, y, z)  Explicand [x, y, z]  Baseline [x, y, z]  Exp. value Difference  Distance  Shapley  Shapley  IG  IG 

[1, 2, 3]  [0, 0, 0]  2.0  2.0  [1. 1. 0.]  [1. 1. 0.]  [1. 1. 0.]  [1. 1. 0.]  
[1, 2, 3]  [0, 0, 0]  1.0  1.0  [ 1. 2. 0.]  [0. 1. 0.]  [ 1. 2. 0.]  [1. 2. 0.]  
[1, 2, 3]  [0, 0, 0]  0.0  0.0  [ 1. 2. 3.]  [0. 0. 0.]  [ 1. 2. 3.]  [0. 0. 0.]  
[1, 2, 3]  [0, 0, 0]  7.0  7.0  [ 1. 1. 9.]  [0.33 0.33 7.67]  [ 1., 1., 9., 0.]  [1., 1., 9.]  
[1, 2, 3]  [0, 0, 0]  1.0  1.0  [0.5 0.5 0. ]  [0.5 0.5 0.]  [1 0 0. ]  [1 0 0. ]  
[1, 2, 3]  [0, 0, 0]  1.0  1.0  [0. 1. 0.]  [0. 1. 0.]  [1 2 0]  [1 2 0] 
8 The concept of Alignment
Given the need of a baseline in the Shapley value and IG formulations, it is natural to ask what is the right baseline, given that the attributions will differ with the choice of baseline. This is one of the most important questions in explainability.(sundararajan2017axiomatic) recommends choosing a baseline where the model’s prediction is neutral. (merrick2020explanation) argues for contrastive explanations, with justification from norm theory (kahneman1986norm).
In GroupShapley and GroupIG when using the drift function, we take the counterpart in the other sample as baseline, when both samples are aligned by their sorted prediction values. The distance is based on the concept of optimal transport, and hence, the intuition extends naturally to the flow from the attributions, which make up the prediction from one distribution to the other.
For other drift metrics, there may not be a natural reason to align in any particular way. But the alignment of the distance still can be justified as comparing the most similar points in the two samples, if the prediction of model is viewed as a task specific dimensionality reduction. Fliptest (black2020fliptest) uses a similar thought process for assessing individual fairness by creating counterfactuals via optimal transport.
The alternative, where no choice needs to be made, is to take the expectation over all possible alignments.
9 Analysis
We now look at some practical examples of how the choice of drift function impacts the explanations.
9.1 Simple Experiments
We analyze BShap and IG for a few functions in Table 2, using both the expected value difference and the distance. These are functions of three variables x, y, and z, the baseline for all is [0, 0, 0], and the explicand is [1,2,3]. For the function , we see that the attributions are different for different for both BShap and IG. It seems that the drift function gives sparser attributions for BShap, by compressing the attributions for the features that act in opposite direction to the eventual predicted value. For instance, for , is 1 and is 2, so the prediction is 1, and the distance from the baseline prediction is 1. The method gives all the attribution to , as it has the sign of the prediction. We can see this behavior for and
as well. This is reminiscent of how the L1 norm sparsifies coefficients in ridge regression, but we make no claims of there being any analogy between the two.
There is no reason to always prefer the explanation of one over the other, both can be justified in their own way and are a matter of choice, similar to how choosing a baseline is a choice depending on the question one is looking to answer.
9.2 Case Study
We now present a simple case study, to demonstrate how this might work in practice, by constructing a synthetic dataset. This allows us to inject known and controlled drifts in order to evaluate the effectiveness of various methods at finding them.
We create a dataset of the following features:

[noitemsep]

Location  {‘Springfield’, ‘Centerville’}  70:30

Education  {‘GRAD’,‘POST_GRAD’}  80:20

Experience  years  (0, 50)  normally distributed

Engineer Type  {‘Software’,‘Hardware’}  85:15

Relevant Experience  years  (0, 50)  normally distributed
and ensure that experience relevant experience.
The model predicts an individual’s salary from the features above, using the following formula:
2000 events are created for each of three days. On the second day, a plausible data pipeline bug is introduced, whereby the location feature has the value “springfield” rather than “Springfield”. Because of this, all locations are identified as ‘Centerville’, which leads to an average salary drop for day two– a prediction drift. We now would like to attribute this the offending features. Figure 1 shows the drift over time measured by the various drift methods previously discussed.
In Figure 2, we calculate GroupShapley attributions over the fifteen featureday combinations, and see that the job location feature gets the most attribution, as we would expect.
Additionally, we compare our approach to that of (pinto2019automatic)
, which measures drift using Jensen Shannon divergence and trains a Gradient Boosted Tree Classifier to identify the drift. The feature importances of the classifier are used to identify the cause. In the scenario described, it correctly gives the most attribution to the location feature. But if we introduce another spurious drift, of an unimportant feature like experience, the GBDT method selects the wrong feature. They do suggest a technique to remove timetrended features, but if the other feature also spikes in the same interval, that fix will not help either as seen in Figure
3.10 Conclusion and Future Work
We study the problem of prediction drift and attributing it, and propose it as a general framework of explainability, unifying several methods. We axiomatize certain desirable properties of distributional difference metrics, also demonstrating that explanation methods can be parameterized by the choice of this metric.
A more detailed study of the theoretical implications of choosing one distance metric over another for explanations is left for future work. Additionally, GroupShapley can be computationally expensive, and approximation schemes for faster calculations could be a future area of exploration.
11 Appendix
11.1 Axioms
We will now go over the reasons for the desirability of the axioms:

Dummy  We do not want to credit a group/feature that makes no contribution to the model prediction.

Efficiency  This ensures a complete accounting of difference in the model’s prediction between the explicand and the baseline.

Linearity  This property helps in avoiding counterintuitive behavior when analyzing attributions of linear functions.

Symmetry  The purpose of this axiom is selfevident, if two groups contribute equally they should receive the same attribution.

Affine Scale Invariance  The justification for this is based on the idea that the units of measurement of individual features may not be comparable to each other, and secondly, within themselves, may not be canonical. For example, units of weight like pounds or kilograms are not more or less justified than the other, and the conversion to the other should not lead to a decrease in attribution. (friedman1999three)

Demand Monotonicity  For a function that is monotonic with respect to a group, if the group value increases while all else is held constant, the function’s value will increase. It is natural to want the attribution to the group to increase as compared to the previous scenario.

Proportionality  This ensures that the attributions to groups are proportional to their contribution in the additive sum of the group values. Let’s look at a heat generation scenario. If there are three current sources, each supplying the same amount of current. The heat generated is proportional to the square of the current. The attribution to each should be onethird, compared to the zero baseline. Now if we combine two of the current sources, the attribution of the third should ideally remain the same.
11.2 Proofs for Drift Metrics satisfying Axioms
Wasserstein1 Distance
Given two samples and of length and sorted by value, the distance can be computed .
The distance for empirical samples satisfies the following distributional axioms:
Proofs:

Sensitivity  This is trivial to see, given that each point of the sample contributes to the overall sum.

Differentiability  The function is piecewise differentiable, except at zero for each absolute difference.

Symmetry  The formula is symmetric in and .

Identity of Indiscernibles  The distance can be zero only if every elementpair in the two samples cancels each other out.
Expected value difference
Given two samples and of length , the Expected value distance is .
Proofs:

Sensitivity  Each point of the sample contributes to the overall sum.

Differentiability  One can see that the function is differentiable everywhere.

Directionality  The sign changes when the sample order is flipped.

Identity of Indiscernibles  This can be proved by a counter example. If there is a sample that only has values 1, and the other has equal number of zeros and twos. The two means will be equal and will cancel out, even though the two samples are not the same.
Jensen Shannon Divergence
The Jensen Shannon Divergence (JSD) given two probability distributions P and Q is defined as where and is the KullbackLiebler divergence.
While it is difficult to analyze JSD’s behavior given empirical samples, we can see that it does not satisfy Sensitivity and Directionality.
Proofs:

Sensitivity  This can be proved by a counter example. If there are two distributions that don’t intersect anywhere, the JSD is one. Now if we translate the second distribution while ensuring there is no intersection, the JSD is still 1.

Directionality  JSD is symmetric to the change in the sample order.
KolmogorovSmirnov Test Statistic for Two Samples
For two distributions and , the KS statistic distance is where and are the empirical Cumulative Distribution Functions (CDF) of and
We can see from the definition that the KS Statistic satisfies Symmetry and the Identity of Indiscernibles axiom. For the other proofs please refer to (miroshnikov2020wasserstein)
Comments
There are no comments yet.