Who's responsible? Jointly quantifying the contribution of the learning algorithm and training data

10/09/2019 ∙ by Gal Yona, et al. ∙ 0

A fancy learning algorithm A outperforms a baseline method B when they are both trained on the same data. Should A get all of the credit for the improved performance or does the training data also deserve some credit? When deployed in a new setting from a different domain, however, A makes more mistakes than B. How much of the blame should go to the learning algorithm or the training data? Such questions are becoming increasingly important and prevalent as we aim to make ML more accountable. Their answers would also help us allocate resources between algorithm design and data collection. In this paper, we formalize these questions and provide a principled Extended Shapley framework to jointly quantify the contribution of the learning algorithm and training data. Extended Shapley uniquely satisfies several natural properties that ensure equitable treatment of data and algorithm. Through experiments and theoretical analysis, we demonstrate that Extended Shapley has several important applications: 1) it provides a new metric of ML performance improvement that disentangles the influence of the data regime and the algorithm; 2) it facilitates ML accountability by properly assigning responsibility for mistakes; 3) it provides more robustness to manipulation by the ML designer.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In machine learning (ML), the standard way to evaluate a new learning algorithm

is to compare its performance with the performance of a baseline algorithm , when and are trained on the same dataset . For example, if and achieves 0.9 and 0.7 accuracy, then papers typically report that is better than by 0.2. Implicit in this ubiquitous practice is the assumption that itself is solely responsible for all of the difference in performance. Is this always a reasonable assumption? Could the training data also deserve some of the credit for the improvement?

Taking this example one step further, suppose and are deployed in a new setting, which may not be identical to the training distribution, and the new accuracies drop to 0.5 and 0.7 respectively. Is the learning algorithm now entirely to be blamed for the 0.2 performance gap here? Perhaps some of the responsibility lay with the training data as well.

How to quantify and assign credit for learning algorithms is an foundational component towards making ML more accountable, fair and transparent. This is an increasingly important question especially as learning algorithms are increasingly wide-spread and regulated. However this question has not been rigorously studied. In this paper, we develop a principled Extended Shapley framework to jointly model and quantify the contributions of the training data and the learning algorithm. Our Extended Shapley uniquely satisfies several desirable fairness properties. We derive analytical characterizations of the value of the ML designer and the value of individual training datum. And we demonstrate how this framework naturally addresses the questions posed above: the value of the ML designer in Extended Shapley quantifies how much of the improvement or drop in performance (compared to a baseline) it is responsible for. This provides a new metric to assess the progress made by new learning algorithms and it also gives insights into how algorithm’s value depends on the data distribution.

Related work

Shapley value was proposed in a classic paper in game theory

[24]. It has been applied to analyze and model diverse problems including voting and bargaining [20, 10]. Recent works have adopted Shapley values to quantify the contribution of data in ML tasks [9, 12, 1]. However in these settings, the ML algorithm is assumed to be given and the focus is purely on the data value. In contrast, in our extended Shapley framework, we jointly model the contribution of both the learning algorithm and the data. Shapley value has also been used in a very different ML context as a feature importance score to interpret black-box predictive models [21, 14, 6, 17, 5, 8, 4, 16]

. Their goal is to quantify, for a given prediction, which features are the most influential for the model output. There is also a literature in estimating Shapley value using Monte Carlo methods, network approximations, as well as analytically solving Shapley value in specialized settings

[7, 19, 3, 18, 11].

Algorithmic accountability is an important step towards making ML more fair and transparent [28]. However there has been relatively little work on quantitative metrics of accountability. To the best of our knowledge, this is the first work that jointly quantifies the contribution of training data and the learning algorithm in ML.

2 Jointly modeling the contribution of data and algorithm

2.1 Data valuation

Recent works studied how to quantify the value of the individual training datum for a fixed ML model [9]. In the data valuation setting, we are given a training set of data points and we have a fixed learning algorithm 111We will often denote by its index to simplify notation.. can be trained on any subset and produces a predictor . The performance of is quantified by a particular metric—e.g. accuracy, , etc.—evaluated on the test data, and we denote this as . More formally, such that } denote the space of performance functions and . The algorithm’s overall performance is and the goal of data valuation is to partition among .

By drawing an analogy to the classic Shapley value in cooperative game theory [24, 23], it turns out that that there is a unique data valuation scheme, , that satisfies four reasonable equitability principles:

  1. Null player: If is a null player in —i.e. —then .

  2. Linearity: If are two performance functions, then .

  3. Efficiency: .

  4. Symmetry: If are identical in —i.e. —then .

This unique data Shapley value is given by:


where denote the value of the -th training point and is a weighting term [9].

2.2 Extended Shapley for data and learning algorithm

While the data Shapley of Eqn. 1

is useful—e.g. it identifies poor quality data and can improve active learning

[9]—it implicitly assumes that all of the performance is due to the training data, since . This data-centric view is limiting, since we know that a good design of the learning algorithm can greatly improve performance, and hence deserves some credit. Conversely, if fails on a test domain, then should share some of the responsibility.

Much of ML takes the other extreme—the algorithm-centric perspective—in assigning credit/responsibility. Many ML papers present a new learning algorithm by showing a performance gap over a baseline algorithm on a standard dataset . It is often framed or implicitly assumed that the entire performance gap is due to the higher quality of over . As we will discuss, this algorithm centric perspective is also limiting, since it completely ignores the contribution of the training data in to this performance gap.

In many real-world applications, we would like to jointly model the value of the data and learning algorithm. Training data takes time and resource to collect and label. Developing an advanced algorithm tailored for a particular dataset or task also requires time and resource, and it can lead to a better outcome than just using the off-the-shelf-method (e.g. a carefully architectured deep net vs. running scikit-learn SVM [22] ). Moreover, in many applications, the algorithm is adaptively designed based on the data. Therefore it makes sense to quantify the contribution of the data and algorithm together.

First attempt at model

To gain intuition as we build up to our proposed solution, consider the following naive extension of Data Shapley to this new setup: explicitly add the algorithm and to the coalition as the -th and synthetic “data”. Then use the Data Shapley as above (Equation 1) with respect to the enlarged coalition and the following modified value function for :


where and are the performance of trained on , respectively. However, this model doesn’t capture our intuition that is the new algorithm that we want to evaluate and is the off-the-shelf benchmark. For example, suppose the ML developer is lazy and provides a fancy algorithm which is exactly the same as under-the-hood. Then and would be completely symmetric in Eqn. 2, and they would receive the same value (which could be substantial) under Eqn. 1. In this case, the ML developer would receive credit for doing absolutely no work, which is undesirable.

Extended Shapley

We now formally define our extended Shapley framework. We will build on the notation set up in Section 2.1. Suppose we have a ML designer who develops algorithm and denotes a baseline ML model. For example, could be a new tailored-designed network and is the off-the-shelf SVM. Both and are trained on the dataset . Our goal is to jointly quantify the value of each datum in as well as the ML designer who proposes . Including the baseline makes the framework more flexible and captures more realistic settings.

Let denote the performances of and , respectively. Denote . We are interested in assigning value to both the datapoints and the ML designer for developing instead of . That is, we are now looking for an extended valuation function, . Note that is defined over pairs of games, and that its’ range is in . We interpret for as the value for a datapoint and , sometimes denoted , as the payoff for the ML designer. The setting is intrinsically asymmetric between and and we are not interested in the value of the baseline. The inclusion of the baseline gives us more modeling flexibility. For example, if we are interested in the value of without comparing it to a baseline, we can always set .

There are infinitely many possible valuation functions . Following the approach of the original Shapley value, we take an axiomatic approach by first laying out a set of reasonable properties that we would like an equitable valuation to satisfy, and then analyze the resulting value.

3 Extended Shapley

We would like extended Shapley valuation to satisfy the following properties, which are natural extensions of the original Shapley axioms.

  1. Extended Null Player: If is a null player in both , then it should receive . Additionally, if is identical to the benchmark —i.e. —the designer should receive .

  2. Linearity: Let be any four performance functions. Then, ; similarly, .

  3. Efficiency w.r.t : .

  4. Symmetry between data: If are identical in both , then .

  5. Equitability between data and algorithm: if adding and adding have the same effect—i.e. , —then .

P1 to P4 are direct analogues of the fundamental fairness axioms of the Shapley value and the data Shapley value. P5 is also a reasonable property that ensures that the algorithm and data get the same value if they have the same effect. It might be helpful to think of the extreme case where is simply adding another datum to the training set and then apply . In this case, P5 is equivalent to P4.

There is a unique valuation that satisfies to , and it is given by




where . We call the Extended Shapley values.


Please see Appendix A for the proof. ∎

Each of the properties P1 to P5 is necessary to ensure the uniqueness of the valuation . In Eqn. 3, the coefficient is . Thus intuitively, instead of simply comparing the difference in the two algorithms’ overall performance , Extended Shapley considers the difference , where is chosen randomly as follows: first a set size is chosen uniformly at random from 0 to , and then a random subset of that size is chosen.

In Eqn. 4, if it weren’t for the additional weight terms and , the term on the left would be identical to the payoff of under Data Shapley w.r.t and the term on the right would be the payoff of under Data Shapley w.r.t . Thus, one way of interpreting these two quantities is as variants of the Shapley payoff that adjusts the importance of subsets according to their size: since as , the expression on the left down-weights the importance ofthe marginal contribution of to smaller subsets whereas the expression on the right down-weights the importance of the marginal contribution of to larger subsets.

We can provide a different view of the ML designer’s value under Extended Shapley, that is related to the notion of leave-one-out stability.

[] For a game , define as follows:

The value of the ML designer under Extended Shapley can also be written as

In some cases, the view of ’s value in Lemma 3 can provide a upper bound on without directly computing the expression in Eqn. 3.

[] If are -leave-one-out-stable w.r.t the performance metric , i.e., for both and

Then .

The proof can be found in the Appendix A. The following example demonstrates that the upper-bound is tight.

Suppose that and are given by and . In this case, the value of under Extended Shapley is exactly :

Note that this is exactly the upper-bound from Lemma 3, since is stable and is stable.

In general, while computing the Extended Shapley values in Eqn. 3, 4 are expensive, we can efficiently estimate them with Monte Carlo approximations [9].

4 Experiments and Applications of Extended Shapley

With the definition and charectarizations of Extended Shapley in place, we highlight several benefits and applications of jointly accounting for the value of the algorithm and the data points.

4.1 Measuring algorithmic performance

In the following experiments, we apply the Extended Shapley framework in several settings. Computing the exact Shapley values requires exponentially large number of computations in and therefore we use the TMC-Shapley Monte-Carlo approximation algorithm introduced in previous work [9].

How algorithm’s value depends on the data distribution

We begin our experiments by applying Extended Shapley to a simple but illuminating setting where

is a nearest neighbor classifier. We will see that Extended Shapley provides interesting insights into the way the algorithm’s value depends on the data distribution and the algorithm itself.

Figure 1: Extended shapley

(a) The 3NN algorithm is applied to binary classification problems with various levels of difficulty. Using 300 training points, the algorithm achieves the same performance level on all datasets. Meanwhile, it has a smaller Shapley value w.r.t. the Majority Vote benchmark (area between the curves and the grey line) as the dataset increases in difficulty (the # of intervals increases). (b) Different algorithms applied to the same binary classification problem with 8 intervals. Given 300 training points, all of them have the same test performance while 1NN has the highest Shapley value. (c) Two different algorithm (1NN and 3NN) applied to the same binary classification problem (each color represents one of the labels). The Extended Shapley value of the data points tend to be higher near the interval boundaries. In (a) and (b), the shaded areas stand for standard deviation of

different runs of the experiment.

For simplicity, the data points are scalars uniformly sampled from and are assigned a label using a binary labeling function. The labeling function divides the interval into several sub-intervals. Points in each sub-interval is assigned one of the two labels, and the adjacent intervals are assigned the opposite label, and so on. The sub-intervals are randomly chosen such that the resulting labeling function is balanced; i.e. sub-intervals of each label will cover half of (Fig. 1(c)). is a simple NN (k-nearest neighbors) algorithm. The benchmark is a simple majority-vote classifier that assigns the same label to all of the test data points. The performance of and is evaluated on a balanced test set sampled from the same distribution. As the labeling function is balanced, for . An version of this problem corresponds to a choice of a labeling function.

We first show how the value of under Extended Shapley depends on the data distribution. In Fig. 1(a), we apply 3NN on training points. There are 4 versions of the data distribution, corresponding to 2,4,8 and 16 intervals. As the number of intervals increases, the data distribution becomes harder for the 3NN algorithm. The majority vote baseline achieves 0.5 accuracy in all the versions. Each curve in Fig. 1(a) plots the averaged across the subsets of the same cardinality (shown on the x-axis). The Extended Shapley value of , according to Eqn. 3, is exactly the area between the performance curve and the Majority Vote baseline. On the whole dataset , the final performance is the same across all the data versions. Interestingly, the value of decreases as the number of intervals increase. This reflects the intuition that the 3NN extracts less useful information when the neighbors have more alternating labels.

Next we consider the complementary setting where we fix a version of the data distribution with 8 intervals, and apply several kNNs for (Fig. 1(b)). Here the different kNNs all of the same final performance on the full data . However, the value of decreases as increases. This is interesting because when the data has 8 intervals, using more neighbors is noisier especially for smaller training size.

Finally we investigate the Extended Shapley value of individual data points. We fix a particular version with 6 intervals and applied 1NN and 3NN in this setting (Fig. 1(c)). For each given data point, we create a data set of size by sampling data points of the same distribution and compute its value in that data set. We repeat this process 100 times and take the average value of that point. The individual data values are plotted in the bottom two panels of Fig. 1(c). The Extended Shapley data values are higher closer to the interval boundaries, as those points are informative. The data values for 3NN is overall slightly higher than for 1NN. The 3NN values are noisier since points in one interval (esp. near the boundary) can lead to mistakes in the adjacent interval when chosen as neighbors.

Figure 2: Shapley value of disease prediction algorithms

For a training data set of size 1000: (a) The algorithm that has a better test performance (random forest) is also the one with a higher value. (b) Both algorithms have the same performance while having different Shapley values. (c) The algorithm with the better performance (logistic regression) is the one with a smaller Shapley value.

Performance versus Extended Shapley value

As a real-world example, we compare the performance and the Extended Shapley value of two different algorithms for the problem of disease prediction. The task is to predict whether given an individual’s phenotypic data, they will be diagnosed with a certain disease in the future. We worked three problems, predicting malignant neoplasms of breast, lung and ovary (ICD10 codes C50, C34, and C56) from the UK Biobank data set [25]. For each disease we create a balanced training data set of 1000 individuals with half of them diagnosed with the disease. We then compare the value of two algorithms—logistic regression and random forest—against the majority-vote benchmark .

The three diseases illustrate three interesting scenarios of how the Extended Shapley gives insights different from simply comparing the overall model performances (Fig. 2). For breast cancer prediction, the random forest algorithm has a higher performance ( vs ) and a higher Extended Shapley value ( vs ). For lung cancer prediction, both algorithms have almost the same performance ( and ) while random forest has a higher value ( vs ) because it learns more from smaller subset of data. Ovary cancer is a setting where the logistic regression model, while having a better performance ( vs ), has a lower Extended Shapley value ( vs ). This suggests that even though logistic regression achieves good final accuracy, the training data itself deserves some of its credit, since the model performs poorly on smaller subsets. A balanced test set of size 500 was used to compute the value and performance in all of the three problems.

Fair algorithm gets credit for reducing disparity

It has been demonstrated that ML models perform poorly on dark-skinned women for detecting gender from face images [2]. One way to compensate for this problem is to increase the weight of the images in the minority subgroups in the cost function of the learning algorithm. Following work in the literature [13], we use a set of 1000 images from the LFW+A[27] data set that has an imbalanced representation of different subgroups ( female, black) as our training data. The performance is measured using the maximum disparity (difference in prediction accuracy of different subgroups) on 800 images of the PPB[2] data set; a data set designed to have equal representation of sex and different skin colors. All images are transformed into 128 dimensional features by passing through a Inception-Resnet-V1[26] architecture pre-trained on the Celeb A[15] data set (more than 200,000 face images). The benchmark is a logistic regression algorithm and is a weighted logistic regression algorithm where the weight for samples of each subgroup is inversely proportional to the subgroup size. achieves a lower disparity of while results in a disparity ( also has a higher accuracy of vs ). The Shapley value of against the benchmark is , almost equal to the difference in the disparity (). This means that most of the credit for the less biased performance goes to the design of the fairness-aware algorithm .

4.2 Algorithmic Accountability

Figure 3: Left: (source) and (target); represent labels and numbers represent each colors’ fractional mass. Note that and differ only in that the labels of the green points are flipped. Algorithms are trained on , but are evaluated on . This causes the non-linear classifier to under-perform relative to the benchmark linear classifier : . Right: A comparison between evaluating the algorithm’s value using Extended Shapley (blue) versus the marginal difference (red), where the fractional mass of the green points varies between and . At both lines start at since without any mis-labeled examples, the linear and non-linear algorithms perform equally well. Note that despite the similar shape, under Extended Shapley the algorithm’s value is much less negative, suggesting that the issues with the data are largely accountable for the performance gap.

Consider the following motivating scenario: a company develops a state-of-the-art classification algorithm that is intended to identify individuals that require critical medical attention. Team is in charge of collecting and preparing appropriate datasets to feed the ML model; Team uses this data, as well as their combination of technical and domain expertise, to train the models. Some time after the product’s deployment, an independent investigation reveals that there is a subpopulation of individuals for which the deployed model provides near-meaningless predictions. The company wants to understand why the product fail in this way, and who’s responsible. While this is clearly a toy example, it captures several characteristics and challenges of ML accountability. The conditions under which algorithms are eventually deployed typically differ from those in which they were developed (e.g. domain adaptation). And when things do go wrong, who’s to blame? This is not merely for finger-pointing: it is a crucial part of the model development process and “debugging” procedure. In this section, we take a technical (and inherently narrow) perspective to this broad question of accountability. In particular, we explore the way Extended Shapley—by explicitly quantifying the value of both the ML designer and the data—may provide a means of disentangling the effect of the algorithm choice (Team 2) from the effect of the training data (Team 1).

We instantiate the motivating scenario as follows. Suppose we have a source and a target distributions. The source distribution is the distribution from which a dataset is sampled; this dataset will be used to train all subsequent algorithms. A second dataset is sampled from the target distribution and used to evaluate deployment performance; i.e., the metric calculates some measure of accuracy on . We assume that and differ in that there is one sub-population whose labels are incorrectly flipped in the source distribution; see Figure 3 for a visualization. As our benchmark

we take the class of linear classifiers (hyperplanes). The ML designer, on the other hand chooses to work with a slightly more complex class:

is the intersection of (up to) two hyperplanes. To simplify the problem, we ignore issues of optimization and sample complexity and assume that and can solve their respective risk minimization problems exactly. We denote with the classifiers obtained by applying (resp., ) on .

As Figure 3 shows, the mislabeling of the subpopulation in the source distribution (left) means that actually under-performs relative to on (right). A naive approach for attributing “blame” might consider the marginal difference in performance of algorithm relative to the benchmark : in this case, . One way to interpret this is that the ML designer’s choice of an “incorrect” hypothesis class (non-linear classifiers) has a cost of . However there is another source for the gap in performance between and : the difference in the source and target distributions (the mis-labeled training points). Intuitively, if it were not for the presence of these erronously-labeled data points, would not have performed so poorly, and may have even performed better. Therefore, it’s no longer clear that the marginal difference of is entirely ’s “fault”. The right panel of Figure 3 demonstrates that Extended Shapley indeed takes this into account, assigning a value that’s much less negative than . See Appendix B for a detailed description of the setup of this experiment.

4.3 Robust Data Valuation

The regular data Shapley (Eqn. 1) is highly vulnerable to attacks by the ML designer. We show that the ML designer can manipulate the learning algorithm such that the overall performance is unchanged but the data Shapley value is substantially altered.

Suppose that the ML designer has some “favourite” point, which we denote as . We now show that for every algorithm the ML designer wishes to use, it can achieve identical performance while guaranteeing that Data Shapley will allocate the payoffs strictly to . Indeed, define the following variant of that operates as follows:

where is some fixed predictor, say .

Note that , so the overall performance remained identical. However, under , the marginal contribution of any point under any subset is zero. This implies that . From efficiency, this also implies that . In others words, the ML designer’s favourite point is now the sole receiver of the value. This example illustrates how the Data Shapley value is vulnerable to adversarial manipulations.

Intuitively, the construction in Example 4.3 exploits the fact that Data Shapley assumes the algorithm it evaluates is given as a black-box [9]. This might suggest that a way of guaranteeing robustness to manipulation would be to require the ML designer to commit to a particular structural form (e.g., applying a certain pre-processing procedure on the data and then fitting logistic regression), and disclosing it to the party conducting the data valuation. The requirement for full disclosure of the model could be a hurdle to applying this approach in practical use-cases. But as we demonstrate in the next example, even that does not rule out every form of manipulation from the ML designers’ side.

Assume that now there is a subset of points belonging to a minority subpopulation, which the ML designer wishes to down-weight their value. The designer can now define (and disclose) as follows: before applying , remove from the training set points from subpopulations that have less than

examples. This form of pre-processing could be justified for better generalization performance (e.g. as a form of outlier removal), yet significantly hurts the value of points in

: under , each only has marginal contribution for subsets that already include the rest of .

We claim that Extended Shapley provides a certain robustness to manipulations without knowing algorithm except that its performance doesn’t decrease when given larger training set (which is often the case). In particular, an immediate corollary of Proposition 3 implies the existence of a lower-bound on the payoff assigned to each that’s independent of the ML designer.

The Shapley value of every satisfies:

This highlights the benefit of measuring the performance of relative to a fixed benchmark . In practical applications, this benchmark can be chosen by the auditor conducting the data valuation; we therefore think of it as non-adversarial. This demonstrates that for an appropriate benchmark model , Extended Shapley can more robustly account for the usefulness of data than the regular data Shapley value.

5 Conclusions

Extended Shapley provides a principled framework to jointly quantify the contribution of the learning algorithm and training data to the overall models’ performance. It’s a step in formalizing the notion of ML accountability, which is increasingly important especially as ML becomes more wide-spread in mission critical applications. The strong axiomatic foundation of Extended Shapley guarantees equitable treatment of data and algorithm. We have demonstrated that Extended Shapley can be used in several important applications, such as measuring progress in ML and assigning responsibility for failure cases. In these applications we focused on using Extended Shapley as a diagnostics tool, and it provides interesting insights into how algorithm’s value depends on the training dataset. A natural and interesting direction for future work is to investigate the extent to which the insights that the Shapley value provides can be used to improve performance at test time. Another important direction of future work is to generalize Extended Shapley to multiple algorithms and multiple benchmarks. This could be useful in practice (when it’s not clear which single algorithm should serve as the baseline) and also could provide a way of strengthening the guarantees against manipulation to the individual data points.


  • [1] A. Agarwal, M. Dahleh, and T. Sarkar (2018) A marketplace for data: an algorithmic solution. arXiv preprint arXiv:1805.08125. Cited by: §1.
  • [2] J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp. 77–91. Cited by: §4.1.
  • [3] J. Castro, D. Gómez, and J. Tejada (2009) Polynomial calculation of the shapley value based on sampling. Computers & Operations Research 36 (5), pp. 1726–1730. Cited by: §1.
  • [4] J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan (2018) L-shapley and c-shapley: efficient model interpretation for structured data. arXiv preprint arXiv:1808.02610. Cited by: §1.
  • [5] S. Cohen, G. Dror, and E. Ruppin (2007) Feature selection via coalitional game theory. Neural Computation 19 (7), pp. 1939–1961. Cited by: §1.
  • [6] A. Datta, S. Sen, and Y. Zick (2016) Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pp. 598–617. Cited by: §1.
  • [7] S. S. Fatima, M. Wooldridge, and N. R. Jennings (2008) A linear approximation method for the shapley value. Artificial Intelligence 172 (14), pp. 1673–1699. Cited by: §1.
  • [8] A. Ghorbani, A. Abid, and J. Zou (2019)

    Interpretation of neural networks is fragile

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3681–3688. Cited by: §1.
  • [9] A. Ghorbani and J. Zou (2019) Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. Cited by: §1, §2.1, §2.1, §2.2, §3, §4.1, §4.3.
  • [10] F. Gul (1989) Bargaining foundations of shapley value. Econometrica: Journal of the Econometric Society, pp. 81–95. Cited by: §1.
  • [11] H. Hamers, B. Husslage, R. Lindelauf, T. Campen, et al. (2016) A new approximation method for the shapley value applied to the wtc 9/11 terrorist attack. Technical report Cited by: §1.
  • [12] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M. Gurel, B. Li, C. Zhang, D. Song, and C. Spanos (2019) Towards efficient data valuation based on the shapley value. arXiv preprint arXiv:1902.10275. Cited by: §1.
  • [13] M. P. Kim, A. Ghorbani, and J. Zou (2019) Multiaccuracy: black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247–254. Cited by: §4.1.
  • [14] I. Kononenko et al. (2010) An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research 11 (Jan), pp. 1–18. Cited by: §1.
  • [15] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 3730–3738. Cited by: §4.1.
  • [16] S. M. Lundberg, G. G. Erion, and S. Lee (2018) Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888. Cited by: §1.
  • [17] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §1.
  • [18] S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan, and A. Rogers (2013) Bounding the estimation error of sampling-based shapley value approximation. arXiv preprint arXiv:1306.4265. Cited by: §1.
  • [19] T. P. Michalak, K. V. Aadithya, P. L. Szczepanski, B. Ravindran, and N. R. Jennings (2013) Efficient computation of the shapley value for game-theoretic network centrality. Journal of Artificial Intelligence Research 46, pp. 607–650. Cited by: §1.
  • [20] J. W. Milnor and L. S. Shapley (1978) Values of large games ii: oceanic games. Mathematics of operations research 3 (4), pp. 290–307. Cited by: §1.
  • [21] A. B. Owen (2014) Sobol’indices and shapley value. SIAM/ASA Journal on Uncertainty Quantification 2 (1), pp. 245–251. Cited by: §1.
  • [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2.2.
  • [23] L. S. Shapley, A. E. Roth, et al. (1988) The shapley value: essays in honor of lloyd s. shapley. Cambridge University Press. Cited by: §2.1.
  • [24] L. S. Shapley (1953) A value for n-person games. Contributions to the Theory of Games 2 (28), pp. 307–317. Cited by: §1, §2.1.
  • [25] C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray, et al. (2015) UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12 (3), pp. e1001779. Cited by: §4.1.
  • [26] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In AAAI, Vol. 4, pp. 12. Cited by: §4.1.
  • [27] L. Wolf, T. Hassner, and Y. Taigman (2011)

    Effective unconstrained face recognition by combining multiple descriptors and learned background statistics

    IEEE transactions on pattern analysis and machine intelligence 33 (10), pp. 1978–1990. Cited by: §4.1.
  • [28] J. Zou and L. Schiebinger (2018) AI can be sexist and racist—it’s time to make it fair. Nature Publishing Group. Cited by: §1.

Appendix A Proofs

a.1 Proof of Proposition 3

The proof of the proposition will consist of two parts: first we define a valuation scheme and prove that it satisfies P1-P5; then we will show that this valuation scheme takes the form in the proposition statement.

Define as follows:

where is the regular Data Shapley and is defined as


Recall that satisfies S1-S4. We will now use this to prove that satisfies properties P1-P5.

P1. First, assume is a null player in and . We claim that this implies is a null player in the game . To show this, we must prove that for every , . Indeed: if the subset includes , the requirement is equivalent to ; if it doesn’t, the requirement is equivalent to . In either case the statement is true since was assumed to be a null player in both and . Thus from S1 we have that . Finally, consider . Note that for any , the requirement that is identical, by the definition of , to . Thus the assumption of P1 means is a null player in the game . From S1 we therefore have that , as required.

P2. Suppose are identical in both and ; we will show that this means they are identical in the game . Consider subsets of . Suppose a subset includes , then is equivalent to and the later is true since are identical in . Similarly if the subset doesn’t include , this is equivalent to which is true since are identical in . Thus from S2, we have that .

P3. We will prove that is linear in its first component; an identical argument can be used to show linearity in the second. Let be any three games in . Note that satisfies: . Thus from S3, . We therefore have that , as required.


where the second transition is from S4 and the third is directly by the definition of .

P5. Note that by the definition of , is equivalent to , for every . Then the assumption in this property implies that and are identical; from S2, we have that , as required.

We now turn to prove that the value assigned to the algorithm and to the datapoints follow the expressions in Equations 3 and 4. For the value of the algorithm, this follows directly by applying the definition of the Shapley value in the player game:

We now turn to proving Equation 4. First, to simplify notation, we will use to denote the marginal difference of evaluated on versus :

By the definition of Extended Shapley,

We now split this sum into two terms, those subsets that include and those that don’t:

and consider each term separately. For the term on the right, note that iterating over subsets of that don’t include is equivalent to iterating over subsets of , and that in this case is evaluated as ; we can therefore re-write the term on the right as:


For the term on the left:

Together, we have that that can be written as

The proof can be concluded by observing that and .

a.2 Proof of Lemma 3

First, we note that by combining the efficiency axiom with the charectarization of the previous proposition, we can also write the ML designer’s value as

To see this, first note that from the efficiency property of Extended Shapley (P4), . Additionally, from the efficiency property of the standard Data Shapley (S4) w.r.t , . Thus, . By the definition of the Shapley value (Equation 1), the term on the left is . Now, by substituting with the expression for the value of the datapoints (Equation (4) and re-arranging, we get exactly the above.

Next, we note that in general the sum is equivalent to the sum . We can therefore write:

Combining these two facts, we have:

The proof can be concluded by noting that .

a.3 Proof of Lemma 3

The claim follows by proving that if is -stable in the sense defined in the statement of the lemma, then

To prove this, we employ the second view of the ML designer’s value from the previous lemma; in particular, it guarantees that

Which we can simplify as follows:

Together, we can conclude the required.

Appendix B Details of the experiments in Section 4.2

Figure 4: A problem instance is defined by , the fractional mass of the green points. They are mis-labeled: in the source distribution their label is , yet in the target distribution it is . The other two groups are identical between the source and target distribution, and their fractional mass are given by .

Every point in the graph on the right hand side of Figure 3 corresponds to a problem instance as illustrated in Figure 4. We now describe how we compute the value of the algorithm for each problem instance. It is sufficient to define and . For simplicity, we assume that the classifier obtained by training algorithms and on a subset is only a function of the colors of the points in ; see Figure 1 for a full specification of the assumed behaviours of and .

y, b, g
y, b
b, g
y, g
Table 1: Bottom to top: when there is only a single color or when consists of only yellow and green points, then the entire training set has the same label and we assume the output is some dummy classifier . When consists of blue and green points, both models output the linear classifier (yellow line). When consists of yellow and blue points, both models output the linear classifier . Finally, when consists of all three colors, the linear model outputs whereas outputs the non-linear classifier .

The performance of the classifiers above depends on the problem instance, which is defined by , the fractional mass of the yellow, blue and green subgroups. The following table summarizes

as a function of these probabilities:

y, b, g
y, b
b, g
y, g 0 0
y 0 0
g 0 0
b 0 0
Table 2: The performance of is , since on it correctly classifies the blue points and half of the yellow points. The performance of is , since it only errs on the green points. Finally, the performance of is 1.0 since it correctly classies all the points.