1 Introduction
Largescale machine learning and data mining depend on data contributed by various individuals and institutions. With the popularity of such datadriven systems, there is increasing awareness of the value of data and associated privacy risks. As data sharing and data marketplaces become common, it is necessary to accurately evaluate a contributor’s data to provide them with the right compensation. On the other hand, identifying highvalue data is also of advantage to other stakeholders such as data users. For these purposes, the Shapley value has been proposed as a fair method of determining the value of each data point Jia et al. (2019a); Liu et al. (2021).
The Shapley value is a concept from game theory
Shapley (1953), defined to evaluate the contribution of individual players in a cooperative game. The concept is very general and can be applied to complex setups where multiple elements interact to produce results. Thus, it has been applied to understanding different elements in machine learning and is a popular tool in interpretable machine learning Lundberg and Lee (2017). It is used to determine the importance of individual features Cohen et al. (2007); Tripathi et al. (2020), neurons
Ghorbani and Zou (2020), models in ensembles Rozemberczki and Sarkar (2021), data points Ghorbani and Zou (2019); Jia et al. (2019a) and many others (see Rozemberczki et al. (2022)). However, the use of Shapley values is challenging from the perspectives of computation and privacy.The computation of Shapley values requires evaluating the marginal contribution of a player (for us, a data point) with respect to all possible coalitions (subsets) of players (See Section 2). The computational problem is complete Deng and Papadimitriou (1994). Additionally, each such evaluation involves training and evaluating a model. Approximations based on Monte Carlo sampling Maleki et al. (2013) of coalitions can reduce the cost to evaluations, which is still prohibitive in large datasets. The reduction of computation (number of evaluations) has been the focus of several recent works on the Shapley value of data Ghorbani and Zou (2019); Jia et al. (2019a); Kwon et al. (2021).
From the privacy perspective, the Shapley value poses a complex challenge, since the contribution of a data point is influenced by the configuration of all other data in the set. Our objective is to answer a query for : Shapley value of data point , which will require access to the entire dataset. Even Monte Carlo methods that sample random coalitions require the use of almost all data points (Section 3). Conversely, answering a query for can leak the privacy of any other data point . To address this challenge, we develop an algorithm specifically for data evaluation that heavily samples smaller coalitions. This algorithm is more compatible with differential privacy and needs to access only a small fraction of data. Thus it is more privacyfriendly than existing methods.
Our contributions. Our approach is based on diminishing marginal gains with increasing data volume. In machine learning problems, larger training datasets are desirable, but the incremental benefit per data point decreases with increasing data size. This effect has been seen in past experiments (e.g. Sun et al. (2017) – Fig. 4). We show theoretically, that for an empirical risk minimization (ERM) problem, the marginal reduction in loss per data point is inversely proportional to the data size i.e. (Subsection 3.1). Thus, each new data point contributes less to the objective of loss minimization in larger datasets. Similar theoretical bounds on marginal differences hold for uniformly stable algorithms such as regularised ERM.
Using the property that the marginal utility of a data point is bounded by for coalitions of size , we devise a Shapley value computation algorithm in Subsection 3.2
. This algorithm stratifies the coalitions into layers and samples the layers with varying probability. We call this the
Layered Shapley Algorithm. The algorithm heavily samples the lower layers with small coalitions, and sparsely samples the higher layers with large coalitions. The intuition is that, given the diminishing returns property, small coalitions provide sufficient information on a data point’s utility. Not much remains to be gained by examining large coalitions, where the marginal utility is anyway guaranteed to be small. This algorithm relies on evaluating coalition samples and uses only a fraction of the dataset. It is thus highly efficient in the number of evaluations and data usage.In Subsection 3.3 we discuss the differentially private Shapley value computation based on the Layered Shapley Algorithm. We use the strong sampling property of the algorithm with samplingbased privacy amplification results to get differential privacy at the price of relatively small noise.
Some properties of the Layered algorithm and related results are discussed in Subsection 3.4. We observe that the bias towards evaluating smaller coalitions significantly helps the computational costs since small coalitions are cheaper to train. The data points and coalitions needed to compute a value , can be saved and used to answer Shapley value queries on the same dataset. Thus, the Layered algorithm produces a natural small core set for querying Shapley values. Finally, we argue that in the realistic case where a contributor may submit a set of points, the aggregate evaluation of the set can be carried out at a small relative cost.
Experimental results are discussed in Section 4, where we demonstrate that Shapley values calculated via the Layered Shapley Algorithm, and their differentially private counterparts, successfully describe the relative utility of data points within multiple binary classification tasks despite their reliance on small coalition sizes. The private algorithm approximately preserves the relative ranks of data points compared to the nonprivate version. Related works are discussed in Section 5. Proofs of theorems can be found in the appendix. In the next section, we start by reviewing the background on Shapley value and differential privacy. Readers familiar with the topics may want to quickly skim the section to note the definitions and notations.
2 Preliminaries
2.1 The Shapley value
The Shapley Value was originally developed to evaluate the contributions of different players in a cooperative game Shapley (1953)
. In machine learning, the set of players are usually elements of the input to the training algorithm. For example, the input features can be treated as players to estimate their relative importance. Analogously, to evaluate the relative importance of different data points, they will be treated as players and the Shapley value of a data point will represent its importance in the training process.
In a game with players (which may be features or data points as the case may be), the Shapley value of player , written as , is defined in terms of their marginal contributions to coalitions of other players. Suppose is the set of players, and is the set of all possible subsets (coalitions) of players. The utility obtained by any coalition is given by a value function , and the marginal contribution of player with respect to is written as . The Shapley value is then defined by:
(1) 
Observe that this definition is equivalent to computing the average marginal gains of over coalitions of each possible size, and then averaging over all possible sizes.
The appeal of the Shapley value is that it provides a fair allocation of credit, more meaningful than simple marginal contributions. This fairness is characterized by several intuitive properties, such as efficiency, symmetry, null player, and linearity. Shapley value is the unique valuation function that satisfies all these. See the survey Rozemberczki et al. (2022) for details of these properties.
In the context of machine learning,
is often defined in terms of the loss function, measuring how much an element
helps in minimizing loss. In the typical data evaluation problems Ghorbani and Zou (2019); Ghorbani et al. (2020); Jia et al. (2019b), each data point is treated as a player, and is the model trained on coalition . If is the loss of on , then can be defined as . Thus, the Shapley value is larger for data points that help more in minimizing the loss. Both the training or empirical loss Tripathi et al. (2020) and validation loss Jia et al. (2019b) have been used to define in machine learning research.In the feature evaluation problem Guyon and Elisseeff (2003); Fryer et al. (2021), each feature is treated as a player, and for a subset of features, is defined analogously in terms of the loss.
2.1.1 Approximate computation of Shapley value
The definition of the Shapley value requires computing for an exponential number of coalitions, making it computationally expensive. The typical approach to tractable computation is to perform a Monte Carlo estimate over the set of coalitions. Suppose is a permutation of , taken uniformly at random, and is the set of items occuring before in . Then the basic sampling based algorithm Castro et al. (2009) computes the average marginal gain over a sample of such subsets to obtain the approximate Shapley value: . In the case of all being bounded by a constant , the sample complexity of achieves an approximation guarantee Maleki et al. (2013): .
2.2 Differential Privacy
The privacy of data points is at risk even when computing a seemingly complex aggregate value such as a machine learning model Shokri et al. (2017), or in our case a Shapley value. The computation of Shapley value uses every other data value and thus risks their privacy. Differential privacy Dwork (2006) is designed to defend against such privacy leaks. It provides a statistical privacy guarantee for all data points by ensuring that the value is statistically insensitive to the presence or absence of individual data points.
Definition 2.1 (Neighbouring Databases).
Two databases are neighbouring if , where represents the hamming distance.
Definition 2.2 (Differential Privacy Dwork (2006)).
A randomized algorithm satisfies differential privacy if for all neighbouring databases and and for all possible outputs ,
The sensitivity of a function is defined to be the maximum change in the function value between neighboring databases: . The sensitivity determines the appropriate scale of noise to add to to achieve differential privacy, as follows:
Theorem 2.3 (Laplace Mechanism).
Given a function , the Laplace Mechanism releasing satisfies differential privacy.
As we will discuss, one approach to releasing a privacypreserving Shapley value is to determine its sensitivity and add the appropriate amount of noise. The challenge will be to do this while maintaining accurate estimates of the value.
Symbol  Definition  Symbol  Definition 

Shapley value of data point  Data size  
Data set  Set of permutation samples  
Sample size  A coalition  
All coalitions  All coalitions of size  
Valuation function  Mariginal gain of over  
Approximation parameters  Differential privacy 
3 Algorithms and analysis
We have discussed that the computation of Shapley values is expensive. Even with samplingbased approximations, large fractions of the dataset are used to answer a single query for . To see this, consider a single random permutation . With probability at least , data point is in the second half of . Thus with a probability of at least , and at least half the dataset will be required to compute a single marginal value. Since the computation of involves many such marginal valuations, nearly the entire dataset is used to answer a query for a Shapley value. In addition to the risk of exposing all data to the agent performing the computation, the large coalition sizes create challenges in terms of differential privacy, since a query for any may reveal information about any other data point.
In this section, we argue that by using the specific properties of the marginal loss in machine learning, we can improve upon these issues – with an estimation algorithm that uses only a small fraction of data. In the following subsection we analyze the intuitive idea that in larger datasets, the marginal contributions of individual data points are proportionally smaller.
3.1 Diminishing marginal gains with data
In this subsection, we discuss how increasing data volume reduces the marginal gain per data point (e.g. Sun et al. (2017) – Fig. 4). With increasing data, the algorithm approaches the optimal model, and the loss converges to the minimum, with tinier steps. This effect can be seen more formally in the case of empirical risk minimization using the simple setup of binary classification with loss ShalevShwartz and BenDavid (2014). Given a set of size with labelled data points , the empirical risk of a model is defined by
– that is, the fraction of points incorrectly classified by
. The models are drawn from a hypothesis class . The optimal model in the class is the one that minimizes the loss over . Since the loss is an average over the number of data points, the introduction of an additional data point can only change the risk by :Observation 3.1.
For any subset and new data point , the marginal change in loss of the optimal model is bounded by .
In machine learning, a natural value function is defined by the empirical (training) loss: . Or, if an upper bound on is known, then possibly . In either case, for a data point and any set , the observation above implies a bound on the marginal gain of w.r.t : .
Regularized ERM and stability. The bound on marginal difference holds more generally in stable machine learning. One of the commonly used stability notions is Uniform Stability Bousquet and Elisseeff (2002). Suppose the dataset of size contains points for from the domain . Let represent with data point removed. Suppose we write to denote the loss on a point of the model computed by algorithm on data . Given this, Uniform Stability ensures that the change in the loss for any datapoint is bounded by when any individual point is removed from the training set:
Definition 3.2 (Uniform stability Bousquet and Elisseeff (2002)).
A learning algorithm has uniform stability with respect to the loss function if the following holds,
The definition implies a bound on the marginal gain of . For example, for regularized algorithms such as L2regularized regression in reproducing kernel Hilbert spaces with kernel , regularization strength and Lipschitz constant Bousquet and Elisseeff (2002). When is defined to be either the averaged empirical or validation loss, this implies that the marginal difference is bounded by , and so is .
Uniformly stable algorithms are known to have strong generalization properties Bousquet and Elisseeff (2002); Feldman and Vondrak (2018)
and for this reason, are commonly used in research and practice. For example, regularized ERM methods such as linear and logistic regression with L2 regularization satisfy this property. Several other forms of regularizers and learning algorithms satisfy the property as well (See
Bousquet and Elisseeff (2002); Audiffren and Kadri (2013)). For popular techniques such as stochastic gradient descent, there has been recent progress in establishing stability. Uniform stability with
marginal differences is known to hold for SGD in expectation even in nonconvex cases Hardt et al. (2016).Next, we see how the property of marginal differences can be used to design improved algorithms for the Shapley value of data.
3.2 Layered Shapley value Algorithm
Our approach to designing an efficient algorithm is to leverage the assumption that the mariginal difference in the value of a coalition on addition of any single data point is bounded by , where is the size of and is a constant independent of .
With this assumption, we can increase the probability of small coalitions being evaluated, since the difference in value increases slowly with coalition size. The algorithm is presented as Algorithm 1. It operates by stratifying the coalitions into layers by their sizes, and then estimating the expected marginal gain from in each layer. The algorithm is analogous to selecting random coalitions from layer . Observe that drops rapidly as the coalition sizes increase. Where is smaller than , the algorithm is probabilistically equivalent to sampling layer with probability .
Theorem 3.3.
The estimate is an approximation, that is, , and is computed using a coalition sample complexity of .
The proof of Theorem 3.3 is in the appendix. The proof essentially relies on Hoeffding’s bound to show that the estimate for each layer is within a small error, and uses the union bound to argue that the average error over all layers is probably small.
We have noted earlier that access to large data volumes for each query is undesirable. The following theorem shows that on each query, the algorithm only needs to access a small fraction of data:
Theorem 3.4.
The probability that data point is used in the computation of is bounded by .
In the next section we will see how to use this result to provide differentially private Shapley values.
3.3 Differentially Private Shapley Values
Using the layered sampling approach outlined in Algorithm 1 together with the bounded marginal contributions discussed in Section 3.1, a differentially private Shapley value can be released via the Laplace mechanism Dwork et al. (2006)^{1}^{1}1Note that this approach could be trivially extended to satisfy differential privacy via the Gaussian mechanism. For problems with marginal contributions bounded by , the sensitivity of the Shapley value is also bounded and can be used to ensure differential privacy. In this case, this output perturbation approach is preferable to perturbing intermediate steps of the algorithm (e.g. using private machine learning to evaluate ). This is due to both the necessity of composition over evaluations (where is the total number of coalitions evaluated) in that case, as it involves evaluating 2 machine learning models per sampled coalition, and the fact that private machine learning is designed in principle to mask the differences due to a single point that are measured by the marginal contribution. Instead, we combine layered sampling with the bounded sensitivity to release the private Shapley value (Algorithm 2).
This algorithm can be shown to be differentially private (Thm.3.5).
Theorem 3.5.
Algorithm 2 satisfies differential privacy with noise scale where .
The proof first demonstrates that the sensitivity of the approximated Shapley value is given by and then makes use of the fact that any data point has a small probability of being used. This allows us to use results of privacy amplification by sampling Kellaris and Papadopoulos (2013); Beimel et al. (2014); Balle et al. (2018) to obtain differential privacy without excessive noise.
3.4 Properties and other observations
Computation and data access costs. Compared to algorithms that use a large number of coalition samples and almost all data points, the Layered Shapley approach works with data points. This is a system advantage, since accessing large datasets can incur many disk/ network/ device access costs.
Computationally, the small data requirement implies that the average coalition is only in size. Since a training algorithm needs to run for each coalition, this gives a large advantage. For example, assuming that the training algorithm in question runs in time, one of the traditional approximation algorithms that require data points, will require a running cost. Whereas, the Layered algorithm will run in time.
Small sets for evaluations. The sets generated on a run of the algorithm can be saved and treated like a core set – a small sample of a large dataset that serves to approximate results for future queries. In such a setup, each contributor needs to submit only a small fraction of their data for the general service of data evaluation. This is more privacyfriendly and likely to be acceptable for both individuals and institutions.
Valuation of data subsets. In practice, it is likely that a single contributor submits multiple data points, and the point of interest is that the total or average value (and corresponding compensation) is accurate.
If a person submits data points, then it follows from a simple probabilistic analysis, that to ensure that the average cost is within an error of , a sample complexity if suffices. Thus, multiple data contributions and queries for subsets effectively decrease the samples and costs.
4 Experiments
We now provide empirical results demonstrating the efficacy of our algorithms on binary classification tasks with regularized logistic regression.
Experimental Setup: Experiments were performed using both publicly available binary classification datasets and synthetic data matching the dataset used by Ghorbani and Zou (2019).^{2}^{2}2The code used in these experiments is an extension of https://github.com/amiratag/DataShapley Ghorbani and Zou (2019) which is licenced under the MIT License(See https://github.com/amiratag/DataShapley/blob/master/README.md). The Adult dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license and the Diabetes dataset under the Creative Commons Attribution 1.0 Universal (CC0 1.0) license. The publicly available datasets used were the Adult dataset Kohavi and others (1996) and the Diabetes dataset from the UCI Machine Learning Repository Frank and Asuncion (2010). The synthetic data was generated by following the synthetic data generation approach of Ghorbani and Zou (2019)
including sampling features from a 50dimensional multidimensional Gaussian distribution
. All experiments use the ScikitLearn Pedregosa et al. (2011) implementation of regularized logistic regression and the appropriate noise scale for . Private algorithm performance was reported as an average over 5 runs. In these experiments is defined to be the negative heldout loss and coalitions with below the random guessing baseline were not included. See the Appendix for further experimental details.Results: As shown by Figure 1, our private and nonprivate Shapley value algorithms identify valuable datapoints. In the top row of Figure 1 removing datapoints with high private or nonprivate Shapley values results in a faster drop in classifier accuracy in comparison to removing the same number of randomly selected datapoints via the random Shapley value baseline. This implies that the identified points are of higher value to the accuracy of the classifier than randomly selected points. On the bottom row, we see that removing low value points results in a gain in accuracy initially before dropping more slowly in comparison to removing randomly selected points. For the Diabetes dataset, this holds for the bottom 20% of data only whereas for the other datasets it holds more generally. Overall, this implies that the lowest Shapley value points are those with low value for the classifier in the sense that removing them helps performance. Together, these results imply that the private Shapley values obtained contain meaningful information about the value of a given datapoint for a classifier, even in the case of differential privacy with .
We also report the Spearman Rank Correlation between the private and nonprivate Layered Shapley values. The minimum value for this correlation coefficient is implying a perfect negative correlation, implies no correlation and implies positive correlation. As the obtained values are significantly above , e.g. an averaged value of or for the Adult and Diabetes datasets respectively, this implies that the private values largely conserve the approximate rank (note that preserving rank precisely will contradict privacy).
5 Related Work
Data Valuation. Data valuation is a relatively new field and has not been widely addressed until recent years Jia et al. (2019c); Ghorbani and Zou (2019); Ghorbani et al. (2020); Yoon et al. (2020); González Cabañas et al. (2017); Wei et al. (2020). One main intuition is to value the contribution of different data sources after a model is learned. The valuation can be further used to, for example, make a reasonable payment to each data contributor, which has been discussed and applied in crowdsourcing and federated learning Jia et al. (2019c); Ghorbani and Zou (2019); Wei et al. (2020). There are several algorithms for evaluating the data points such as leaveoneout testing Cook (1977), influence function estimation Koh and Liang (2017); Sharchilev et al. (2018); Datta et al. (2015); Pruthi et al. (2020), and core sets Dasgupta et al. (2009). However, the purposes of these works are mostly for model explanation or stability improvement of models. For tasks such as rewarding the data contributors require additional properties such as fairness or privacy Li et al. (2020, 2021). On the other hand, Shapley value as a data valuation method provides axiomatic fairness properties Jia et al. (2019c).
Shapley value in machine learning. Shapley value has found numerous applications in machine learning Rozemberczki et al. (2022). Due to the hardness of the computation of exact Shapley values, approximation algorithms for Shapley values are heavily discussed. Maleki, et al. Maleki et al. (2013) provide a general bound on Shapley value with Monte Carlo sampling and show the efficiency of stratified sampling under certain assumptions. In data evaluation applications, Ghorbani, et al. proposed a framework for utilizing Shapley value in a datasharing system Ghorbani and Zou (2019). Jia, et al. advanced the work with more detail and several efficient algorithms to approximate the Shapley value under different assumptions Jia et al. (2019c). The distributional Shapley value also has been discussed in Ghorbani et al. (2020); Kwon et al. (2021), to address incremental updates to Shapley values, which is difficult under Monte Carlo approximation methods. Their methods calculate the Shapley value over a distribution, without revealing the true Shapley value in the output.
Several works have explored the use of Shapley values in feature selection and importance. Here, the Shapley values of features quantify how much individual features contribute to the model’s performance on a set of data points
Guyon and Elisseeff (2003); Fryer et al. (2021); Cohen et al. (2005); Patel et al. (2021); Tripathi et al. (2020); Sun et al. (2012); Guha et al. (2021); Williamson and Feng (2020). Several different approximation approaches have been proposed for the feature Shapley value. Cohen Cohen et al. (2007) assumes the number of interactions between features is significantly smaller than the combinatorial number among all features and derives the Shapley value via coalition sets with only constant sizes. Other works use the variable importance measure (VIM) to quantify the predictive value of each feature, which is called Shapley Population Variable Importance Measure (SPVIM) and can be estimated in time, where is the number of features Williamson and Feng (2020); Covert and Lee (2021). In general, Shapley value has been widely used as a scoring mechanism in interpretable machine learning Rozemberczki et al. (2022).In comparison, our work focuses on the differentially private Shapley values of data points. To the best of our knowledge, this is the first work addressing the differential privacy of data point valuation. Shapley value of data points is a particularly challenging matter since datasets can be large and a single value computation requires many evaluations. Our algorithm operates using smaller samples of data points to obtain privacycompatible results.
6 Conclusion
We address the privacy issue of Shapley valuebased data valuation and propose the Layered Shapley value algorithm. The algorithm preserves differential privacy and utilizes the diminishing marginal gain to provide efficient computation. The theoretical bound does not extend to algorithms without uniform stability, such as training neural networks. Stability results in
Hardt et al. (2016) that hold for SGD in expectation suggest that suitable results may be derived in the future. In experiments, we find that both differentially private and nonprivate Shapley values computed by our algorithm are still useful compared with the baseline.We imagine a system where individuals can easily obtain valuations of their data. The theoretical results in this paper provide algorithms, but we are still far from widely usable systems. A major challenge in such a system will be to obtain meaningful value (e.g. compensation) instead of abstract numbers, which will be hard to translate to social value. In such real systems, Shapley value may or may not be the right approach. Its axiomatic properties are often cited as the reason to use it, but to what extent these hold in the Monte Carlo approximations remain to be investigated. It is also unclear if, in the case of data contributions, Shapley values agree with the human intuition of value. A few works have suggested that people may overlook the properties of the Shapley value and have incorrect expectations of it Kumar et al. (2020); Fryer et al. (2021).
Social impact. While this work contributes to the domain of ethical machine learning research by extending data valuation techniques to include privacypreserving valuation, its social impacts can include negative elements. Both Shapley value and differential privacy are nontrivial concepts, and it is unclear if a system combining the two actually helps people, in general, make better decisions, or will simply add to greater uncertainty and fear of technology. Differential privacy, for example, does not provide absolute privacy, but rather a probabilistic one and is dependent on , which may be a source of misunderstanding. Differential privacy is also known to have disparate impact Bagdasaryan et al. (2019); Ganev et al. (2021), and it is unclear if in this case, the algorithm will maintain the fairness of valuations.
The use of such data valuation services themselves may be susceptible to attacks, frauds, and abuse. Unethical players may contribute spurious data points with the objective of increasing their own values or disrupting that of others. Valuation systems may perpetuate fraud and introduce the issue of monitoring the agent performing valuations. Leaked data valuations may make highvalue data holders subject to attacks and fraud. The idea of incentivizing the contribution of (high value) data, while useful in theory, comes with some potential for abuse. Depending on the circumstances, it can be seen as a coercion to contribute data or a penalty for not contributing data. Specifically, highvalue data is also likely to be privacy sensitive, and the incentives can be seen as a push toward loss of privacy.
References
 Stability of multitask kernel regression algorithms. In Asian Conference on Machine Learning, pp. 1–16. Cited by: §3.1.
 Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems 32. Cited by: §6.
 Privacy amplification by subsampling: tight analyses via couplings and divergences. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6280–6290. Cited by: Appendix D, §3.3.
 Bounds on the sample complexity for private learning and private data release. Machine learning 94 (3), pp. 401–437. Cited by: Appendix D, §3.3.
 Stability and generalization. The Journal of Machine Learning Research 2, pp. 499–526. Cited by: Appendix D, §3.1, §3.1, §3.1, Definition 3.2.
 Polynomial calculation of the shapley value based on sampling. Computers & Operations Research 36 (5), pp. 1726–1730. Cited by: §2.1.1.
 Feature selection via coalitional game theory. Neural Computation 19 (7), pp. 1939–1961. Cited by: §1, §5.

Feature Selection Based on the Shapley Value.
In
Proceedings of the 19th International Joint Conference on Artificial Intelligence
, pp. 665–670. Cited by: §5. 
Detection of influential observation in linear regression
. Technometrics. Cited by: §5.  Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. In International Conference on Artificial Intelligence and Statistics, pp. 3457–3465. Cited by: §5.
 Sampling algorithms and coresets for ell_p regression. SIAM Journal on Computing, pp. 2060–2078. Cited by: §5.
 Influence in Classification via Cooperative Game Theory. In TwentyFourth International Joint Conference on Artificial Intelligence, Cited by: §5.
 On the complexity of cooperative solution concepts. Mathematics of operations research 19 (2), pp. 257–266. Cited by: §1.
 Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: Appendix D, §3.3.
 Differential privacy. In Automata, Languages and Programming, pp. 1–12. Cited by: §2.2, Definition 2.2.
 Generalization bounds for uniformly stable algorithms. Advances in Neural Information Processing Systems 31. Cited by: §3.1.
 UCI machine learning repository [http://archive. ics. uci. edu/ml]. irvine, ca: university of california. School of information and computer science 213 (2). Cited by: §4.
 Shapley Values for Feature Selection: the Good, the Bad, and the Axioms. arXiv:2102.10936. Cited by: §2.1, §5, §6.
 Robin hood and matthew effects–differential privacy has disparate impact on synthetic data. arXiv preprint arXiv:2109.11429. Cited by: §6.
 A distributional framework for data valuation. In International Conference on Machine Learning, pp. 3535–3544. Cited by: §2.1, §5, §5.
 Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. Cited by: §1, §1, §2.1, §4, §5, §5, footnote 2.
 Neuron Shapley: Discovering the Responsible Neurons. In Advances in Neural Information Processing Systems, pp. 5922–5932. Cited by: §1.
 FDVT: data valuation tool for facebook users. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 3799–3809. Cited by: §5.
 CGA: a new feature selection model for visual human action recognition. Neural Computing and Applications 33 (10), pp. 5267–5286. Cited by: §5.
 An Introduction to Variable and Feature Selection. Journal of machine learning research 3 (Mar), pp. 1157–1182. Cited by: §2.1, §5.
 Train faster, generalize better: stability of stochastic gradient descent. In International conference on machine learning, pp. 1225–1234. Cited by: §3.1, §6.
 Efficient taskspecific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619. Cited by: §1, §1, §1.
 Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. Cited by: §2.1.
 An empirical and comparative analysis of data valuation with scalable algorithms. Cited by: §5, §5.
 Practical differential privacy via grouping and smoothing. Proceedings of the VLDB Endowment 6 (5), pp. 301–312. Cited by: Appendix D, §3.3.
 Understanding blackbox predictions via influence functions. In International Conference on Machine Learning, pp. 1885–1894. Cited by: §5.

Scaling up the accuracy of naivebayes classifiers: a decisiontree hybrid.
. In Kdd, Vol. 96, pp. 202–207. Cited by: §4.  Problems with shapleyvaluebased explanations as feature importance measures. In International Conference on Machine Learning, pp. 5491–5500. Cited by: §6.
 Efficient computation and analysis of distributional shapley values. In International Conference on Artificial Intelligence and Statistics, pp. 793–801. Cited by: §1, §5.
 A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering. Cited by: §5.
 Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, pp. 50–60. Cited by: §5.
 GTGShapley: Efficient and Accurate Participant Contribution Evaluation in Federated Learning. arXiv:2109.02053. Cited by: §1.
 A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pp. 4768–4777. Cited by: §1.
 Bounding the Estimation Error of Samplingbased Shapley Value Approximation. arXiv:1306.4265. Cited by: §1, §2.1.1, Figure 1, §5.
 GameTheoretic Vocabulary Selection via the Shapley Value and Banzhaf Index. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 2789–2798. Cited by: §5.
 Scikitlearn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: §4.
 Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems 33, pp. 19920–19930. Cited by: §5.
 The Shapley Value of Classifiers in Ensemble Games. In Proceedings of the 30th International Conference on Information and Knowledge Management, pp. 1558–1567. Cited by: §1.
 The shapley value in machine learning. arXiv preprint arXiv:2202.05594. Cited by: §1, §2.1, §5, §5.
 Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §3.1.
 A Value for NPerson Games. Contributions to the Theory of Games, pp. 307–317. Cited by: §1, §2.1.

Finding influential training samples for gradient boosted decision trees
. In International Conference on Machine Learning, pp. 4577–4585. Cited by: §5.  Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp. 3–18. Cited by: §2.2.

Revisiting unreasonable effectiveness of data in deep learning era
. InProceedings of the IEEE international conference on computer vision
, pp. 843–852. Cited by: §1, §3.1.  Feature Evaluation and Selection with Cooperative Game Theory. Pattern recognition 45 (8), pp. 2992–3002. Cited by: §5.
 Interpretable Feature Subset Selection: A Shapley Value Based Approach. In IEEE International Conference on Big Data, pp. 5463–5472. Cited by: §1, §2.1, §5.
 Efficient and fair data valuation for horizontal federated learning. In Federated Learning, pp. 139–152. Cited by: §5.
 Efficient Nonparametric Statistical Inference on Population Feature Importance Using Shapley Values. In International Conference on Machine Learning, pp. 10282–10291. Cited by: §5.

Data valuation using reinforcement learning
. In International Conference on Machine Learning, pp. 10842–10851. Cited by: §5.
Appendix A Proof of Observation 3.1
First observe that for any , . This is because, if is correctly classified by then the difference of the two losses is where is the number of incorrect classifications. Since , the difference is at most . On the other hand, if is classified incorrectly, then the difference is .
In the case when with introduction of the optimal hypothesis does not change, that is , the observation above applies directly and the difference is at most .
Now, in the event when , we consider two cases. Case 1, is if . We know that and that . Therefore . Similarly, in case 2: , we know that . Which implies that .
Appendix B Proof of Theorem 3.2
In a sample from layer , the probability that a coalition containing is used is: .
Thus, given expected number of samples at layer (see discussion above), the probability that appears in a sampled coalition from stratum is . Thus, over the all strata, the probability
Appendix C Proof of Theorem 3.3.
We first show the correctness of approximation.
Lemma C.1.
The estimate of shapley value is an approximation of the true shapley value . That is, .
Proof.
Consider the estimate at any layer . By Algorithm 1, .
Since , using the ChernoffHoeffding bound, we have . Substituting the expression for , we get that .
By union bound, the probability that
Observe the event requires that . Thus:
Now, we can rewrite as . Thus:
∎
Now observe that since each of items in layer is sampled with probability , the expected number of samples in layer is . The sample complexity follows from the summation of the sample complexity of the layers. That is, the sample complexity . Combining with Lemma C.1 gives us the theorem.
Appendix D Proof of Theorem 3.5
Denote the set of all sampled coalitions of size in Algorithm 1 by and let . Assume that that is the loss function and that in a given marginal contribution evaluation, the learning algorithm uses a sampled coalition of size as its training set. Denote the uniform stability of the learning algorithm using a training set with datasize by . Suppose is a neighbouring dataset of , differing in at most a single point that is not the point being evaluated.
When coalition sample has data points, let us denote the set of points in it by . Any neighbor of it is written as respectively. and can differ in at most a single datapoint and so are also neighbouring datasets.
Due to the Laplace Mechanism [14], noise of scale will suffice to guarantee differential privacy, where is the sensitivity of the estimated Shapley Value. By stating in terms of the marginal contributions, we can bound the sensitivity as follows,
The last three inequalities follow from the fact that the sensitivity is for a coalition of size and use the uniform stability bound for regularized algorithms given by [5].
Finally, due to Theorem 3.4 and amplification by sampling [30, 4, 3], Laplace noise with scale suffices with .
Note that this sensitivity and noise scale are asymptotically better than what was stated in the theorem statement, since in our algorithm, . The main body of the paper will be updated in the camera ready version.
Appendix E Further Experimental Details
Parameters: Experiments use the following settings: , , , and . The data is normalized to the range in order to bound .
Compute: The compute requirements of these experiments were low, all experiments were run on laptops using 2.9 GHz QuadCore Intel Core i7 processors.