DistributionalShapley
Distributional Shapley: A Distributional Framework for Data Valuation
view repo
Shapley value is a classic notion from game theory, historically used to quantify the contributions of individuals within groups, and more recently applied to assign values to data points when training machine learning models. Despite its foundational role, a key limitation of the data Shapley framework is that it only provides valuations for points within a fixed data set. It does not account for statistical aspects of the data and does not give a way to reason about points outside the data set. To address these limitations, we propose a novel framework – distributional Shapley – where the value of a point is defined in the context of an underlying data distribution. We prove that distributional Shapley has several desirable statistical properties; for example, the values are stable under perturbations to the data points themselves and to the underlying data distribution. We leverage these properties to develop a new algorithm for estimating values from data, which comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms for computing the (non-distributional) data Shapley values. We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.
READ FULL TEXT VIEW PDFDistributional Shapley: A Distributional Framework for Data Valuation
As data becomes an essential driver of innovation and service, how to quantify the value of data is an increasingly important topic of inquiry with policy, economic, and machine learning (ML) implications. In the policy arena, recent proposals, such as the Dashboard Act in the U.S. Senate, stipulate that large companies quantify the value of data they collect. In the global economy, the business model of many companies involves buying and selling data. For ML engineering, it is often beneficial to know which type of training data is most valuable and, hence, most deserving of resources towards collection and annotation. As such, a principled framework for data valuation would be tremendously useful in all of these domains.
Recent works initiated a formal study of data valuation in ML [GZ19, JDW19b]. In a typical setting, a data set is used to train a ML model, which achieves certain performance, say classification accuracy . The data valuation problem is to assign credit amongst the training set, so that each point gets an “equitable” share for its contribution towards achieving the accuracy. Most works have focused on leveraging Shapley value as the metric to quantify the contribution of individual . The focus on Shapley value is in large part due to the fact that Shapley uniquely satisfies basic properties for equitable credit allocation [Sha53]. Empirical experiments also show that data Shapley is very effective – more so than leave-one-out scores – at identifying points whose addition or removal substantially impacts learning [GAZ17, GZ19].
At a high-level, prior works on data Shapley require three ingredients: (1) a fixed training data set of points; (2) a learning algorithm; and (3) a performance metric that measures the overall value of a trained model. The goal of this work is to significantly reduce the dependency on the first ingredient. While convenient, formulating the value based on a fixed data set disregards crucial statistical considerations and, thus, poses significant practical limitations.
In standard settings, we imagine that data is sampled from a distribution ; measuring the Shapley value with respect to a fixed data set ignores this underlying distribution. It also means that the value of a data point computed within one data set may not make sense when the point is transferred to a new data set. If we actually want to buy and sell data, then it is important that the value of a given data point represents some intrinsic quality of the datum within the distribution. For example, a data seller might determine that has high value based on their data set and sell to a buyer at a high price. Even if the buyer’s data set is drawn from a similar distribution as , the existing data Shapley framework provides no guarantee of consistency between the value of computed within and within . This inconsistency may be especially pronounced in the case when the buyer has significantly less data than the seller.
Extending prior works on data Shapley, we formulate and develop a notion of distributional Shapley value in Section 2
. We define the distributional variant in terms of the original data Shapley: the distributional Shapley value is taken to be the expected data Shapley value, where the data set is drawn i.i.d. from the underlying data distribution. Reformulating this notion of value as a statistical quantity allows us to prove that the notion is stable with respect to perturbations to the inputs as well as the underlying data distribution. Further, we show a mathematical identity that gives an equivalent definition of distributional Shapley as an expected marginal performance increase by adding the point, suggesting an unbiased estimator.
Algorithmic. In Section 3, we develop this estimator into a novel sampling-based algorithm, -Shapley
. In contrast to prior estimation heuristics,
-Shapley comes with strong formal approximation guarantees. Leveraging the stability properties of distributional Shapley value and the simple nature of our algorithm, we develop theoretically-principled optimizations to -Shapley. In our experiments across diverse tasks, the optimizations lead to order-of-magnitude reductions in computational costs while maintaining the quality of estimations.Empirical. Finally, in Section 4, we present a data pricing case study that demonstrates the consistency of values produced by -Shapley. In particular, we show that a data broker can list distributional Shapley values as “prices,” which a collection of buyers all agree are fair (i.e. the data gives each buyer as much value as the seller claims). In all, our results demonstrate that the distributional Shapley framework represents a significant step towards the practical viability of the Shapley-based approaches to data valuation.
Shapley value, introduced in [Sha53], has been studied extensively in the literature on cooperative games and economics [SR88], and has traditionally been used in the valuation of private information and data markets [KPR01, ADS19].
Our work is most directly related to recent works that apply Shapley value to the data valuation problem. [GZ19] developed the notion of “Data Shapley” and provided algorithms to efficiently estimate values. Specifically, leveraging the permutation-based characterization of Shapley value, they developed a “truncated Monte Carlo” sampling scheme (referred to as TMC-Shapley), demonstrating empirical effectiveness across various ML tasks. [JDW19b] introduce several additional approximation methods for efficient computation of Shapley values for training data; subsequently, [JDW19a]
provided an algorithm for exact computation of Shapley values for the specific case of nearest neighbor classifiers.
Beyond data valuation, the Shapley framework has been used in a variety of ML applications, e.g. as a measure of feature importance [CDR07, K10, DSZ16, LL17, CSWJ18]. The idea of a distributional Shapley value bears resemblance to the Aumann-Shapley value [AS74], a measure-theoretic variant of the Shapley that quantifies the value of individuals within a continuous “infinite game.” Our distributional Shapley value focuses on the tangible setting of finite data sets drawn from a (possibly continuous) distribution.
Let denote a data distribution supported on a universe
. For supervised learning problems, we often think of
where and is the output, which can be discrete or continuous. For , let a collection of data points sampled i.i.d. from . Throughout, we use the shorthand and let denote a uniform random sample from .We denote by a potential function^{1}^{1}1We use to indicates any finite Cartesian product of with itself; thus, is well-defined on the any natural number of inputs from . or performance metric, where for any , represents abstractly the value of the subset. While our analysis applies broadly, in our context, we think of as capturing both the learning algorithm and the evaluation metric
. For instance, in the context of training a logistic regression model, we might think of
as returning the population accuracy of the empirical risk minimizer when is the training set.Our starting point is the data Shapley value, proposed in [GZ19, JDW19b] as a way to valuate training data equitably. [Data Shapley Value] Given a potential function and data set where , the data Shapley value of a point is defined as
In words, the data Shapley value of a point is a weighted empirical average over subsets of the marginal potential contribution of to each ; the weighting is such that each possible cardinality is weighted equally. The data Shapley value satisfies a number of desirable properties; indeed, it is the unique valuation function that satisfies the Shapley axioms^{2}^{2}2For completeness, the axioms – symmetry, null player, additivity, and efficiency – are reviewed in Appendix A.. Note that as the data set size grows, the absolute magnitude of individual data points’ values typically scales inversely.
While data Shapley value is a natural solution concept for data valuation, its formulation leads to several limitations. In particular, the values may be very sensitive to the exact choice of ; given another where , the value might be quite different from . At the extreme, if a new point is added to , then in principle, we would have to rerun the procedure to compute the data Shapley values for all points in .
In settings where our data are drawn from an underlying distribution , a natural extension to the data Shapley approach would parameterize the valuation function by , rather than the specific draw of the data set. Such a distributional Shapley value should be more stable, by removing the explicit dependence on the draw of the training data set.
[Distributional Shapley Value] Given a potential function , a distribution supported on , and some , the distributional Shapley value of a point is the expected data Shapley value over data sets of size containing .
In other words, we can think of the data Shapley value as a random variable that depends on the specific draw of data from
. Taking the distributional Shapley valueto be the expectation of this random variable eliminates instability caused by the variance of
. While distributional Shapley is simple to state based on the original Shapley value, to the best of our knowledge, the concept is novel to this work.We note that, while more stable, the distributional Shapley value inherits many of the desirable properties of Shapley, including the Shapley axioms and an expected efficiency property; we cover these in Appendix B. Importantly, distributional Shapley also has a clean characterization as the expected gain in potential by adding to a random data set (of random size). Fixing and , for all and ,
That is, the distributional Shapley value of a point is its expected marginal contribution in to a set of i.i.d. samples from of uniform random cardinality.
The identity holds as a consequence of the definition of data Shapley value and linearity of expectation.
where (2.1) follows by the fact that consists of i.i.d. samples, so each with is identically distributed according to . ∎
Leveraging this characterization, for well-structured problems, it is possible to give analytic expressions for the distributional Shapley values. For instance, consider estimating the mean of a distribution supported on . For a finite subset , we take a potential based on the empirical estimator .
Suppose
has bounded second moments. Then for
and , for mean estimation over is given byfor an explicit constant determined by . Intuitively, this proposition (proved in Appendix B) highlights some desirable properties of distributional Shapley: the expected value for a random is an uniform share of the potential for a randomly drawn data set ; further, a point has above-average value when it is closer to than expected. In general, analytically deriving the distributional Shapley value may not be possible. In Section 3, we show how the characterization of Theorem 2.1 leads to an efficient algorithm for estimating values.
Before presenting our algorithm, we discuss stability properties of distributional Shapley, which are interesting in their own right, but also have algorithmic implications. We show that when the potential function satisfies a natural stability property, the corresponding distributional Shapley value inherits stability under perturbations to the data points and the underlying data distribution. First, we recall a standard notion of deletion stability, often studied in the context of generalization of learning algorithms [BE02]. [Deletion Stability] For potential and non-increasing , is -deletion stable if for all and , for all
We can similarly discuss the idea of replacement stability, where we bound ; note that by the triangle inequality, -deletion stability of implies -replacement stability. To analyze the properties of distributional Shapley, a natural strengthening of replacement stability will be useful, which we call Lipschitz stability. Lipschitz stability is parameterized by a metric , requires the degree of robustness under replacement of with to scale according to the distance . [Lipschitz Stability] Let be a metric space. For potential and non-increasing , is -Lipschitz stable with respect to if for all , , and all ,
By taking to be the trivial metric, where if , we see that Lipschitz-stability generalizes the idea of replacement stability; still, there are natural learning algorithms that satisfy Lipschitz stability for nontrivial metrics. As one example, we show that Regularized empirical risk minimization over a Reproducing Kernel Hilbert Space (RKHS) – a prototypical example of a replacement stable learning algorithm – also satisfies this stronger notion of Lipschitz stability. We include a formal statement and proof in Appendix C.
As discussed, a key limitation with the data Shapley approach for fixed data set is that we can only ascribe values to . Intuitively, however, we would hope that if two points and are similar according to some appropriate metric, then they would receive similar Shapley values. We confirm this intuition for distributional Shapley values when the potential function satisfies Lipschitz stability. Fix a metric space and a distribution over ; let be -Lipschitz stable with respect to . Then for all , for all ,
For any data set size , we expand to express it in terms of .
where (2.2) follows by the assumption that is -Lipschitz stable and linearity of expectation. ∎
Theorem 2.2 suggests that in many settings of interest, the distributional Shapley value will be Lipschitz in . This Lipschitz property also suggests that, given the values of a (sufficiently-diverse) set of points , we may be able to infer the values of unseen points
through interpolation. Concretely, in Section
3.2, we leverage this observation to give an order of magnitude speedup over our baseline estimation algorithm.The distributional Shapley value is naturally parameterized by the underlying data distribution . For two distributions and , given the value , what can we say about the value ? Intuitively, if and are similar under an appropriate metric, we’d expect that the values should not change too much. Indeed, we can formally quantify how the distributional Shapley value is stable under distributional shift under the Wasserstein distance.
For two distributions over , let
be the collection of joint distributions over
, whose marginals are and .^{3}^{3}3That is, for all , if , then and . Fixing a metric over , the Wasserstein distance is the infimum over all such couplings of the expected distance between .(1) |
We formalize the idea that distributional Shapley values are stable under small perturbations to the underlying data distribution as follows. Fix a metric space and let be -Lipschitz stable with respect to . Suppose and are two distributions over . Then, for all and all ,
For notational convenience, for any and subset , we denote . Thus, fixing , we can write as . We analyze for each fixed separately.^{4}^{4}4Note that for a fixed potential , is uninteresting because both sides of the inequality are ; in particular, is always , so the LHS is given by the difference .
Let be some coupling of and . Then, we can expand the expectation as follows.
(2) | ||||
(3) | ||||
(4) |
where (2) and (4) follow by the assumption that the marginals of are and ; and (3) follows by linearity of expectation.
To bound the first term of (4), we expand the difference between and into a telescoping sum of pairs of terms, where we bound each pair to depend on a single draw . For and , denote by ; note that and . Then, we can rewrite as follows.
Now suppose is -Lipschitz stable with respect to ; note that this implies is -Lipschitz stable (because is non-increasing). Then, we obtain the following bound.
(5) | ||||
(6) | ||||
(7) |
where (5) notes and differ on only the th data point; (6) follows from the assumption that is -Lischitz stable and linearity of expectation; and finally (7) follows by the fact that each draw from is i.i.d.
Finally, we note that the argument above worked for an arbitrary coupling in ; thus, we can express the difference in values in terms of the infimum over .
where the first summation is taken over as the term associated with is . ∎
Note that the theorem bounds the difference in values under shifts in distribution holding the potential fixed. Often in applications, we will take the potential function to depend on the underlying data distribution. For instance, we may take to be a measure of population accuracy, e.g. , where is the loss on a point achieved by a model trained on the data set . In the case where we only have access to samples from , we still may want to guarantee that and are close. Thankfully, such a result follows by showing that is close to
, and another application of the triangle inequality. For instance, when the potential is based on the population loss for a Lipschitz loss function, we can bound the difference in the potentials, again, in terms of the Wasserstein distance.
Here, we describe an estimation procedure, -Shapley, for computing distributional Shapley values. To begin, we assume that we can actually sample from the underlying . Then, in Section 3.2, we propose techniques to speed up the estimation and look into the practical issues of obtaining samples from the distribution. The result of these considerations is a practically-motivated variant of the estimation procedure, Fast--Shapley. In Section 3.3, we investigate how these optimizations perform empirically; we show that the strategies provide a way to smoothly trade-off the precision of the valuation for computational cost.
The formulation from Theorem 2.1 suggests a natural algorithm for estimating the distributional Shapley values of a set of points. In particular, the distributional Shapley value is the expectation of the marginal contribution of to on , drawn from a specific distribution over data sets. Thus, the change in performance when we add a point to a data set drawn from the correct distribution will be an unbiased estimate of the distributional Shapley value. Consider the Algorithm 1, -Shapley, which given a subset of data, maintains for each a running average of over randomly drawn .
In each iteration, Algorithm 1 uses a fixed sample to estimate the marginal contribution to for each . This reuse correlates the estimation errors between points in , but provides computational savings. Recall that each evaluation of requires training a ML model using the points in ; thus, using the same for each reduces the number of models to be trained by per iteration. In cases where the can be derived efficiently from , the savings may be even more dramatic; for instance, given a machine-learned model trained on , it may be significantly cheaper to derive a model trained on than retraining from scratch [GGVZ19].
The running time of Algorithm 1 can naively be upper bounded by the product of the number of iterations before termination , the cardinality of the points to valuate, and the expected time to evaluate on data sets of size . We analyze the iteration complexity necessary to achieve -approximations of for each . Fixing a potential and distribution , and , suppose . Algorithm 1
produces unbiased estimates and with probability at least
, . for all . When understanding this (and future) formal approximation guarantees, it is important to note that we take to be an absolute additive error. Recall, however, that is normalized by ; thus, as we take larger, the relative error incurred by a fixed error grows. In this sense, should typically scale inversely as .The claim follows by proving uniform convergence of the estimates for each . Importantly, while the samples in each iteration are correlated across , fixing , the samples are independent across iterations. We include a formal analysis in Appendix D.
Next, we propose two principled ways to speed up the baseline estimation algorithm. Under stability assumptions, the strategies maintain strong formal guarantees on the quality of the learned valuation. We also develop some guiding theory addressing practical issues that arise from the need to sample from . Somewhat counterintuitively, we argue that given only a fixed finite data set , we can still estimate values to high accuracy, for that grows modestly with .
Theorem 2.2 shows that for sufficiently stable potentials , similar points have similar distributional Shapley values. This property of distributional Shapley values is not only useful for inferring the values of points that were not in our original data set, but also suggests an approach for speeding up the computations of values for a fixed . In particular, to estimate the values for (with respect to a sufficiently Lipschitz-stable potential ) to -precision, it suffices to estimate the values for an -cover of , and interpolate (e.g. via nearest neighbor search). Standard arguments show that random sampling is an effective way to construct an -cover [HP11].
As our first optimization, in Algorithm 2, we reduce the number of points to valuate through subsampling. Given a data set to valuate, we first choose a random subset (where each is subsampled into i.i.d. with some probability ); then, we run our estimation procedure on the points in ; finally, we train a regression model on pairs from to predict the values of the points from . By varying the choice of , we can trade-off running time for quality of estimation: recovers the original -Shapley scheme, whereas will be very fast but likely produce noisy valuations.
To understand the running time of Algorithm 1 further, we denote the time to evaluate on a set of cardinality by .^{5}^{5}5We assume that the running time to evaluate is a function of the cardinality of (and not other auxiliary parameters). As such, we can express the asymptotic expected running time as . Note that when corresponds to the accuracy of a model trained on , the complexity of evaluating may grow significantly with . At the same time, as the data set size grows, the marginal effect of adding to the training set tends to decrease; thus, we should need fewer large samples to accurately estimate the marginal effects. Taken together, intuitively, biasing the sampling of towards smaller training sets could result in a faster estimation procedure with similar approximation guarantees.
Concretely, rather than sampling uniformly, we can importance sample each proportional to some non-uniform weights , where the weights decrease for larger . More formally, we weight the draw of based on the stability of . Algorithm 2 takes as input a set of importance weights and samples proportionally; without loss of generality, we assume and let denote a sample drawn such that . We show that for the right choice of weights , sampling improves the overall running time, while maintaining -accurate unbiased estimates of the values .
[Informal] Suppose is -deletion stable and can be evaluated on sets of cardinality in time . For and , Algorithm 2 produces estimates that with probability , are -accurate for all and runs in expected time
To interpret this result, note that if the subsampling probability is large enough that will -cover , then using a nearest-neighbor predictor as will produce -estimates for all . Further, if we imagine , then the computational cost grows as the time it takes to train a model on points scaled by a factor logarithmic in and the failure probability. In fact, Theorem 3.2 is a special case of a more general theorem that provides a recipe for devising an appropriate sampling scheme based on the stability of the potential . In particular, the general theorem (stated and proved in Appendix D) shows that the more stable the potential, the more we can bias sampling in favor of smaller sample sizes.
Estimating distributional Shapley values requires samples from the distribution . In practice, we often want evaluate the values with respect to a distribution for which we only have some database for some large (but finite) . In such a setting, we need to be careful; indeed, avoiding artifacts from a single draw of data is the principle motivation for introducing the distributional Shapley framework. In fact, the analysis of Theorem 3.2 also reveals an upper bound on how big the database should be in order to obtain accurate estimates with respect to . As a concrete bound, if is -deletion stable and we take error, then the database need only be
In other words, for a sufficiently stable potential , the data complexity grows modestly with . Note that, again, this bound leverages the fact that in every iteration, we reuse the same sample for each . See Appendix D for a more detailed analysis.
In practice, we find that sampling subsets of data from the database with replacement works well; we describe the full procedure in Algorithm 2, where we denote an i.i.d. sample of points drawn uniformly from the database as . Finally, we note that ideally, should be close to the size of the training sets that model developers to use; in practice, these data set sizes may vary widely. One appealing aspect of both -Shapley algorithms is that when we estimate values with respect to , the samples we obtain also allow us to simultaneously estimate for any . Indeed, we can simply truncate our estimates to only include samples corresponding to with .
We investigate the empirical effectiveness of the distributional Shapley framework by running experiments in three settings on large real-world data sets. The first setting uses the UK Biobank data set, containing the genotypic and phenotypic data of individuals in the UK [SGA15]; we evaluate a task of predicting whether the patient will be diagnosed with breast cancer using 120 features. Overall, our data has 10K patients (5K diagnosed positively); we use 9K patients as our database (), and take classification accuracy on a hold-out set of 500 patients as the performance metric (). The second data set is Adult Income where the task is to predict whether income exceeds K/yr given 14 personal features [DG17]. With 50K individuals total, we use 40K as our database, and classification accuracy on 5K individuals as our performance metric. In these two experiments, we take the maximum data set size K and K, respectively.
For both settings, we first run -Shapley without optimizations as a baseline. As a point of comparison, in these settings the computational cost of this baseline is on the same order as running the TMC-Shapley algorithm of [GZ19] that computes the data Shapley values for each in the data set . Given this baseline, we evaluate the effectiveness of the proposed optimizations, using weighted sampling and interpolation (separately), for various levels of computational savings. In particular, we vary the sampling weights and subsampling probability to vary the computational cost (where weighting towards smaller and taking smaller each yield more computational savings). All algorithms are truncated when the average absolute change in value in the past iterations is less than .
To evaluate the quality of the distributional Shapley estimates, we perform a point removal experiment, as proposed by [GZ19], where given a training set, we iteratively remove points, retrain the model, and observe how the performance changes. In particular, we remove points from most to least valuable (according to our estimates), and compare to the baseline of removing random points. Intuitively, removing high value data points should result in a more significant drop in the model’s performance. We report the results of this point removal experiment using the values determined using the baseline Algorithm 1, as well as various factor speed-ups (where refers to the computational cost compared to baseline).
As Figure 1 demonstrates, when training a logistic regression model, removing the high distributional Shapley valued points causes a sharp decrease in accuracy on both tasks, even when using the most aggressive weighted sampling and interpolation optimizations. Appendix E reports the results for various other models. As a finer point of investigation, we report the correlation between the estimated values without optimizations and with various levels of computational savings, for a handful of prediction models. Figure 2 plots the curves and shows that the optimizations provide a smooth interpolation between computational cost and recovery, across every model type. It is especially interesting that these trade-offs are consistently smooth across a variety of models using the -loss, which do not necessarily induce a potential with formal guarantees of stability.
In our final setting, we push the limits of what types of data can be valuated. Specifically, by combining both weighted sampling and interpolation (resulting in a speed-up), we estimate the values of K images from the CIFAR10 data set; valuating this data set would be prohibitively expensive using prior Shapley-based techniques. In particular, to obtain accurate estimates for each point, TMC-Shapley would require an unreasonably large number of Monte Carlo iterations due to the sheer size of the data base to valuate. We valuate points based on an image classification task, and demonstrate that the estimates identify highly valuable points, in Appendix E.
Next, we consider a natural setting where a data broker wishes to sell data to various buyers. Each buyer could already own some private data. In particular, suppose the broker plans to sell the set and a buyer holds a private data set ; in this case, the relevant values are the data Shapley values for each . Within the original data Shapley framework, computing these values requires a single party to hold both and . For a multitude of financial and legal concerns, neither party may be willing to send their data to the other before agreeing to the purchase. Such a scenario represents a fundamental limitation of the non-distributional Shapley framework that seemed to jeopardize its practical viability. We argue that the distributional Shapley framework largely resolves this particular issue: without exchanging data up front, the broker simply estimates the values ; in expectation, these values will accurately reflect the value to a buyer with a private data set drawn from a distribution close to .
We report the results of this case study on four large different data sets in Figure 3, whose details are included in Appendix F.
For each data set, a set of buyers holds a small data set ( or points), and the broker sells them a data set of the same size; the buyers then valuate the points in by running the TMC-Shapley algorithm of [GZ19] on .
In Figure 3(a), we show that the rank correlation between the broker’s distributional estimates and the buyer’s observed values is generally high.
Even when the rank correlation is a bit lower (), the broker and buyer agree on the value of the set as a whole.
Specifically, we observe that the seller’s estimates are approximately unbiased,
and the absolute percentage error is low, where
In Figure 3(b), we show the results of a point addition experiment for the Diabetes130 data set. Here, we consider the effect of adding the points of to under three different orderings: according to the broker’s estimates , according to the buyer’s estimates , and under a random ordering. We observe that the performance (classification accuracy) increase by adding the points according to and according to track one another well; after the addition of all of , the resulting models achieve essentially the same performance and considerably outperforming random. We report results for the other data sets in Appendix F.
The present work makes significant progress on understanding statistical aspects in determining the value of data. In particular, by reformulating the data Shapley value as a distributional quantity, we obtain a valuation function that does not depend on a fixed data set; reducing the dependence on the specific draw of data eliminates inconsistencies in valuation that can arise to sampling artifacts. Further, we demonstrate that the distributional Shapley framework provides an avenue to valuate data across a wide variety of tasks, providing stronger theoretical guarantees and orders of magnitude speed-ups over prior estimation schemes. In particular, the stability results that we prove for distributional Shapley (Theorems 2.2 and 1) are not generally true for the original data Shapley due to its dependence on a fixed dataset.
One outstanding limitation of the present work is the reliance on a known task, algorithm, and performance metric (i.e. taking the potential to be fixed). We propose reducing the dependence on these assumptions as a direction for future investigations; indeed, very recent work has started to chip away at the assumption that the learning algorithm is fixed in advance [YGZ19].
The distributional Shapley perspective also raises the thought-provoking research question of whether we can valuate data while protecting the privacy of individuals who contribute their data. One severe limitation of the data Shapley framework, is that the value of every point depends nontrivially on every other point in the data set. In a sense, this makes the data Shapley value an inherently non-private value: the estimate of for a point reveals information about the other points in . By marginalizing the dependence on the data set, the distributional Shapley framework opens the door for to estimating data valuations while satisfying strong notions of privacy, such as differential privacy [DMNS06]. Such an estimation scheme could serve as a powerful tool amidst increasing calls to ensure the privacy of and compensate individuals for their personal data [LNG19].
Lecture notes on Statistical Learning Theory, 2011.
The 22nd International Conference on Artificial Intelligence and Statistics
, pages 1167–1176, 2019.International Journal of Computer Vision
, 115(3):211–252, 2015.Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 2818–2826, 2016.Here, we provide a high-level review of the axioms that Shapley used to describe an equitable valuation function [Sha53]. We consider the data Shapley setting, letting denote the value of for a finite subset with respect to potential .
Symmetry – Consider ; suppose for all , . Then,
That is, if two data points are equivalent, then they should receive the same value.
Null player – Consider ; suppose for all , . Then,
That is, if a data point contributes no marginal gain in potential to any nontrivial subset, then it receives no value.
Additivity – Consider two potentials . For all ,
That is, the value of a data point with respect to the combination of two tasks (addition of two potentials) is the sum of the values with respect to each task (potential) separately.
[[Sha53] ] The Shapley value is the unique valuation function that satisfies the symmetry, null player, and additivity axioms. Additionally, the Shapley value satisfies the desirable property that it allocates all of the value to the contributors.
Efficiency – The sum of the individuals’ Shapley values equals the value of the coalition.
It is straightforward to verify that the distributional Shapley value immediately inherits the properties of symmetry, null player, and additivity (by linearity of expectation). Further, it satisfies an on-average variant of efficiency. Given a potential and a data distribution , for ,
We expand the expected distributional Shapley value with its definition and then apply linearity of expectation.
∎
[Restatement of Proposition 2.4] Suppose has bounded second moments. Then for and , for mean estimation over is given by
for an explicit constant