1 Introduction
Data valuation
addresses the question of how to decide the worth of data. Data analysis based on machine learning (ML) has enabled various applications, such as targeted advertisement, autonomous driving, and healthcare, and creates tremendous business values; at the same time, there lacks a principled way to attribute these values to different data sources. Thus, recently, the problem of data valuation has attracted increasing attention in the research community. In this work, we focus on valuing data in relative terms when the data is used for supervised learning.
The Shapley value has been proposed to value data in recent works (Jia et al., 2019b, a; Ghorbani and Zou, 2019). The Shapley value originates from cooperative game theory and is considered a classic way of distributing total gains generated by the coalition of a set of players. One can formulate supervised learning as a cooperative game between different training points and thus apply the Shapley value to data valuation. An important reason for employing the Shapley value is that it uniquely possesses a set of appealing properties desired by a data value notion, such as fairness and additivity of values in multiple data uses.
Algorithmically, the Shapley value inspects the marginal contribution of a point to every possible subset of training data and averages the marginal contributions over all subsets. Hence, computing the Shapley value is very expensive and the exact calculation has exponential complexity. The existing works on the Shapley valuebased data valuation have been focusing on how to scale up the Shapley value calculation to large training data size. Thestateofart methods to estimate the Shapley value for general models are based on Monte Carlo approximation
(Jia et al., 2019b; Ghorbani and Zou, 2019). However, these methods require to retrain the ML model for a large amount of times; thus, they are only applicable to simple models and small training data size. The first question we ask in the paper is: How can we value data when it comes to large models, such as neural networks, and massive datasets?
In this paper, we propose a simple and efficient heuristic for data valuation based on the Shapley value for any given classification models. Our algorithm obviates the need to retrain the model for multiple times. The complexity of the algorithm scales quasilinearly with the training data size, linearly with the validation data size, and independent of the model size. The key idea underlying our algorithm is to approximate the model by a nearest neighbor classifier, whose locality structure can reduce the number of subsets examined for computing the Shapley value, thus enabling tremendous efficiency improvement.
Moreover, the existing works argue the use of the Shapley value mainly based on interpreting its properties (e.g., fairness, additivity) in the data valuation context. However, can we move beyond these known properties and reason about the “performance” of the Shapley value as a data value measure? In this paper, we formalize two performance metrics specific to data values: one focuses on the predictive power of a data value measure, studying whether it is indicative of a training point’s contribution to some random set; the other focuses on the ability of a data value to discriminate “good” training points from “bad” ones for privacypreserving models. We consider leaveoneout error as a simple baseline data value measure and investigate the advantage of the Shapley value in terms of the two above performance metrics.
Finally, we show that our algorithm is a versatile and scalable tool that can be applied to a wide range of tasks: correcting label noise, detecting watermarks used for claiming the ownership of a data source, summarizing a large dataset, guiding the acquisition of new data, and domain adaptation. On small datasets and models where the complexity of the existing algorithms is acceptable, we demonstrate that our approximation can achieve at least comparable performance on these tasks. We also experiment on large datasets and models in which case prior algorithms all fail to value data in a reasonable amount of time and highlight the scablability of our algorithm.
2 General Frameworks for Data Valuation
We present two frameworks to interpret the value of each training point for supervised learning and discuss their computational challenges.
We first set up the notations to characterize the main ingredients of a supervised learning problem, including the training and testing data, the learning algorithm and the performance measure. Let be the training set, where is a featurelabel pair , and be the testing data. Let be the learning algorithm which maps a training dataset to a model. Let be a performance measure which takes as input training data, any learning algorithm, and validation data and returns a score. We write to denote the performance score of the model trained on a subset of training data using the learning algorithm when testing on . When the learning algorithm and validation data are selfevident, we will suppress the dependence of on them and just use for short. Our goal is to assign a score to each training point , denoted by , indicating its value for the supervised learning problem specified by . We will often write it as or to simplify notation.
2.1 LeaveOneOut Method
One simple way of appraising each training point is to measure its contribution to the rest of the training data:
(1) 
This value measure is referred to as the Leaveoneout (LOO) value. The exact evaluation of the values for training points requires to retrain the model for times and the associated computational cost is prohibitive for large training datasets and large models. For deep neural networks, Koh and Liang (2017)
proposed to estimate the model performance change due to the removal of each training point via influence functions. However, in order to obtain the influence functions, one will need to evaluate the inverse of the Hessian for the loss function. With
training points and model parameters, it requires operations. Koh and Liang (2017) introduced a method to approximate the influence function with complexity, which is still expensive for large networks. In contrast, our method, which will be discussed in Section 3, has complexity independent of model size, thus preferable for models like deep neural networks.2.2 Shapley Valuebased Method
The Shapley value is a classic concept in cooperative game theory to distribute the total gains generated by the coalition of all players. One can think of a supervised learning problem as a cooperative game among training data instances and apply the Shapley value to value the contribution of each training point.
Given a performance measure , the Shapley value for training data is defined as the average marginal contribution of to all possible subsets of formed by other training points:
(2) 
The reason why the Shapley value is appealing for data valuation is that it uniquely satisfies the following properties.

[leftmargin=*,label=]

Group rationality: The utility of the machine learning model is completely distributed among all training points, i.e., . This is a natural rationality requirement because any rational group of data contributors would expect to distribute the full yield of their coalition.

Fairness: (1) Two data points which have identical contributions to the model utility should have the same value. That is, if for data and and any subset , we have , then . (2) Data points with zero marginal contributions to all subsets of the training set should be given zero value, i.e., if for all .

Additivity: When the overall performance measure is the sum of separate performance measures, the overall value of a datum should be the sum of its value under each performance measure for . In machine learning applications, the model performance measure is often evaluated by summing up the individual loss of validation points. We expect the value of a data point in predicting multiple validation points to be the sum of the values in predicting each validation point.
Despite the desirable properties of the Shapley value, calculating the Shapley value is expensive. Evaluating the exact Shapley value involves computing the marginal contribution of each training point to all possible subsets, which is . Such complexity is clearly impractical for valuating a large number of training points. Even worse, for ML tasks, evaluating the utility function per se (e.g., testing accuracy) is computationally expensive as it requires to retrain an ML model.
Ghorbani and Zou (2019) introduced two approaches to approximating the Shapley value based on Monte Carlo approximation. The central idea behind these approaches is to treat the Shapley value of a training point as its expected contribution to a random subset and use sample average to approximate the expectation. By the definition of the Shapley value, the random set has size 0 to
with equal probability (corresponding to the
factor) and is also equally likely to be any subset of a given size (corresponding to the factor). In practice, one can implement an equivalent sampler by drawing a random permutation of the training set. Then, the Shapley value can be estimated by computing the marginal contribution of a point to the points preceding it and averaging the marginal contributions across different permutations. However, these Monte Carlobased approaches cannot circumvent the need to retrain models and therefore are not viable for large models. In our experiments, we found that the approaches in (Ghorbani and Zou, 2019)can manage data size up to one thousand for simple models such as logistic regression and shallow neural networks, while failing to estimate the Shapley value for larger data sizes and deep nets in a reasonable amount of time. We will evaluate runtime in more details in Section
5.3 Scalable Data Valuation via NN proxies
In this section, we explicate our proposed method to achieve efficient data valuation for large training data size and large models like deep nets. The key idea is to approximate the model with a NN, which enjoys efficient algorithms for computing both the LOO and the Shapley value due to its unique locality structure.
3.1 NN Shapley Value
Given a single validation point with the label , the simplest, unweighted version of a NN classifier first finds the top training points that are most similar to and outputs the probability of taking the label as . We assume that the confidence of predicting the right label is used as the performance measure, i.e.,
(3) 
where represents the index of the training feature that is th closest to among the training examples in . Particularly, is abbreviated to . Under this performance measure, the Shapley value can be calculated exactly using the following theorem.
Theorem 1 (Jia et al. (2019a)).
Consider the model performance measure in (3). Then, the Shapley value of each training point can be calculated recursively as follows:
(4)  
(5) 
Theorem 1 can be readily extended to the case of multiple validation points, wherein the utility function is defined as the average of the utility function with respect to each validation point. By the additivity property, the Shapley value with respect to multiple validation points is the average across the Shapley value with respect to every single validation point. We will call the values obtained from (4) and (5) the NN Shapley value hereinafter. For each validation point, computing the NN Shapley value requires only time, which circumvents the exponentially large number of utility evaluations entailed by the Shapley value definition. The intuition for achieveing such exponential improvement is that for NN, the marginal contribution only depends on the relative distance of and nearest neighbors in to the validation point. When calculating the Shapley value, instead of considering all , we only need to focus on the subsets that result in distinctive nearest neighbors.
3.2 An Efficient Algorithm Enabled by the NN Shapley value
By leveraging the
NN Shapley value as a proxy, we propose the following algorithm to value the training data importance for general models that do not enjoy efficient Shapley value calculation methods. For deep nets, the algorithm proceeds as follows: (1) Given the training set, we train the network and obtain the deep features (i.e., the input to the last softmax layer); (2) We train a
NN classifier on the deep features and corresponding labels and further calibrate such that the resulting NN mimics the performance of the original deep net. (3) With a proper choice of obtained from the last step, we employ Theorem 1 to compute the Shapley value of the deep features. For other models, the algorithm directly computes the NN Shapley value on the raw data as a surrogate for the true Shapley value.The complexity of the above algorithm is where is the dimension of deep feature representation. As opposed to Monte Carlobased methods (e.g., (Ghorbani and Zou, 2019; Jia et al., 2019b)), the proposed algorithm does not require to retrain models. It is wellsuited for approximating values for large models as its complexity is independent of model size.
Note that when applied to deep nets, this algorithm neglects the contribution of data points for feature learning as the feature extractor is fixed therein. In other words, the data values produced by this algorithm attempt to distribute the total yield of a cooperative game between deep features, rather than raw data. However, as we will later show in the experiment, these values can still reflect data usefulness in various applications, while being much more efficient than the existing works.
4 Theoretical Comparison Between LOO and the Shapley Value
One may ask how we choose between the LOO method and the Shapley value for valuing training data in machine learning tasks. We have seen that the Shapley value uniquely satisfies several appealing properties. Moreover, prior work (Ghorbani and Zou, 2019) has demonstrated empirical evidence that the Shapley value is more effective than the LOO value for assessing the quality of training data instances. Nevertheless, can we theoretically justify the “valuation performance” of the two value measures?
4.1 Predictive Power of the Value Measures
To justify that the data values produced by a valuation technique can reflect the data usefulness in practice, existing valuation techniques are often examined in terms of their performance to work as a preprocessing step to filter out lowquality data, such as mislabeled or noisy data, in a given dataset. Then, one may train a model based only on the remaining “good” data instances or their combination with additional data. Note that both the LOO and the Shapley value only measure the worth of a data point relative to other points in the given dataset. Since it is still uncertain what data will be used in tandem with the point being valued after data valuation is performed, we hope that the value measures of a point are indicative of the expected performance boost when combining the point with a random set of data points.
In particular, we consider two points that have different values under a given value measure and study whether the expected model performance improvements due to the addition of these two points will have the same order as the estimated values. With the same order, we can confidently select the highervalue point in favor of another when performing ML tasks. We formalize this desirable property in the following definition.
Definition 1.
We say a value measure to be orderpreserving at a pair of training points that have different values if
(6) 
where is an arbitrary random set drawn from some distribution.
For general model performance measures , it is difficult to analyze the orderpreservingness of the corresponding value measures. However, for NN, we can precisely characterize this property for both the LOO and the Shapley value. The formula for the NN Shapley value is given in Theorem 1 and we present the expression for the NN LOO value in the following lemma.
Lemma 1 (NN LOO Value).
Consider the model performance measure in (3). Then, the NN LOO value of each training point can be calculated by if and otherwise.
Now, we are ready to state the theorem that exhibits the orderpreservingness of the NN LOO value and the NN Shapley value.
Theorem 2.
For any given , where , and any given validation point , assume that are sorted according to their similarity to . Let be the feature distance metric according to which is sorted. Suppose that for all and some . Then, is orderpreserving for all pairs of points in ; is orderpreserving only for such that .
Due to the space limit, we will omit all proofs to the appendix. The assumption that in Theorem 2 intuitively means that it is possible to sample points that are further away from than the points in . This assumption can easily hold for reasonable data distributions in continuous space.
Theorem 2 indicates that the NN Shapley value has more predictive power than the NN LOO value—the NN Shapley value can predict the relative utility of any two points in , while the NN LOO value is only able to correctly predict the relative utility of the nearest neighbors of . In Theorem 2, the relative data utility of two points is measured in terms of the model performance difference when using them in combination with a random dataset.
Theorem 2 can be generalized to the setting of multiple validation points using the additivity property. Specifically, for any two training points, the NN Shapley value with respect to multiple validation points is orderpreserving when the order remains the same on each validation point, while the NN LOO value with respect to multiple validation points is orderpreserving when the two points are within the nearest neighbors of all validation points and the order remains the same on each validation point. We can see that similar to the singlevalidationpoint setting, the condition for the NN LOO value with respect to multiple validation points to be orderpreserving is more stringent than that for the KNN Shapley value.
Moreover, we would like to highlight that the definition of orderpreservingness is proposed as a property for data value measures; nevertheless, it can also be regarded as a property of a data value measure estimator. A data value estimator will be orderpreserving if the estimation error is much smaller than the minimum gap between the data value measures of any two points in the training set. Since the estimation error of a consistent estimator can be made arbitrarily small as long with enough samples, a consistent estimator (if exists) for an orderpreserving data value measure is also orderpreserving when the sample size is large. An example for such estimator is the sample average of the marginal contribution of a point to the ones preceding it in multiple random permutations.
4.2 Usability for Differentially Private Algorithms
Since the datasets used for machine learning tasks often contain sensitive information (e.g., medical records), it has been increasingly prevalent to develop privacypreserving learning algorithms. Hence, it is also interesting to study how to value data when the learning algorithm preserves some notion of privacy. Differential privacy (DP) has emerged as a strong privacy guarantee for algorithms on aggregate datasets. The idea of DP is to carefully randomize the algorithm so that the output does not depend too much on any individuals’ data.
Definition 2 (Differential privacy).
is differentially private if for all and for all such that and differ only in one data instance: .
By definition, differential private learning algorithms will hide the influence of one training point on the model performance. Thus, intuitively, it will be more difficult to differentiate “good” data from “bad” ones for differentially private models. We will show that the Shapley value could have more disciminative power than the LOO value when the learning algorithms satisfy DP.
The following theorem states that for differentially private learning algorithms, the values of training data are gradually indistinguishable from dummy points set as the training size grows larger using both the LOO and the Shapley value measures; nonetheless, the value differences vanish faster for the LOO value than the Shapley value.
Theorem 3.
For a learning algorithm that achieves DP when training on data points. Let the performance measure be for . Let . Then, it holds that
(7) 
For typical differentially private learning algorithms, such as adding random noise to stochastic gradient descent, the privacy guarantees will be weaker if we reduce the size of training set (e.g., see Theorem 1 in
(Abadi et al., 2016)). In other words, and are monotonically decreasing functions of , and so is . Therefore, it holds that . The implications of Theorem 3 are threefold. Firstly, the fact that the maximum difference between all training points’ values and the dummy point’s value is directly upper bounded by signifies that stronger privacy guarantees will naturally lead to the increased difficulty to distinguish between data values. Secondly, the monotonic dependence of on indicates that both the LOO and the Shapley value are converging to zero when the training size is very large. Thirdly, by comparing the upper bound for the LOO and the Shapley value, we can see that the convergence rate of the Shapley value is slower and therefore it could have a better chance to differentiate between “good” data from the bad than the LOO value.Note that our results are extendable to general stable learning algorithms, which are insensitive to the removal of an arbitrary point in the training dataset (Bousquet and Elisseeff, 2002). Stable learning algorithms are appealing as they enjoy provable generalization error bounds. Indeed, differentially private algorithms are subsumed by the class of stable algorithms (Wang et al., 2015). A broad variety of other learning algorithms are also stable, including all learning algorithms with Tikhonov regularization. We will leave the corresponding theorem for stable algorithms to Appendix C.
5 Experiments
In this section, we first compare the runtime of our algorithm with the existing works on various dataset. Then, we compare the usefulness of the data values produced by different algorithms based on various applications, including mislabeled data detection, watermark removal, data summarization, active data acquisition, and domain adaptation. We will leave the detailed experimental setting, such as model architecture and hyperparamters of training processes, to the appendix.
5.1 Baselines
We will compare our algorithm, termed KNNShapley, with the following baselines.
Truncated Monte Carlo Shapley (TMCShapley). This is a Monte Carlobased approximation algorithm proposed by Ghorbani and Zou (2019). Monte Carlobased methods regard the Shapley value as the expectation of a training instance’s marginal contribution to a random set and then use the sample mean to approximate it. Evaluating the marginal contribution to a different set requires to retrain the model, which bottlenecks the efficiency of Monte Carlobased methods. TMCShapley combined the Monte Carlo method with a heuristic that ignores the random sets of large size since the contribution of a data point to those sets will be small.
Gradient Shapley (GShapley). This is another Monte Carlobased method proposed by Ghorbani and Zou (2019) and employs a different heuristic to accelerate the algorithm. GShapley approximates the model performance change due to the addition of one training point by taking a gradient descent step using that point and calculating the performance difference. This method is applicable only to the models trained with gradient methods; hence, the method will be included as a baseline in our experimental results when the underlying models are trained using gradient methods.
LeaveOneOut. We use Leaveoneout to refer to the algorithm that calculates the exact model performance due to the removal of a training point. Evaluating the exact leaveoneout error requires to retrain the model on the corresponding reduced dataset for every training point, thus also impractical for large models.
KNNLOO Leaveoneout is nevertheless efficient for NN as shown in Theorem 1. To use the KNNLOO for valuing data, we first approximate the target model with a NN and compute the corresponding NNLOO value. If the model is a deep net, we compute the value on the deep feature representation; otherwise, we compute the value on the raw data.
Random The random baseline does not differentiate between different data’s value and just randomly selects data points from training set to perform a given task.
5.2 Runtime Comparison
We compare their runtime on different datasets and models and exhibit the result for a logistic regression trained on MNIST and a larger model ResNet18 on CIFAR10 in Figure 1. We can see that NN Shapley outperforms the rest of baselines by several orders of magnitude for large training data size and large model size.
5.3 Applications
Most of the following applications are discussed in recent work on data valuation (Ghorbani and Zou, 2019). In this paper, we hope to understand: Can a simple, scalable heuristic to approximate the Shapley value using a KNN surrogate outperforms these, often more computationally expensive, previous approaches on the same set of applications? As a result, our goal is not to outperform stateoftheart methods for each application; instead, we hope to put our work in the context of current efforts in understanding the relationships between different data valuation techniques and their performance on these tasks.
Detecting Noisy Labels
Labels in the real world are often noisy due to automatic labeling, nonexpert labeling, or label corruption by data poisoning adversaries. Even if a human expert can identify incorrectly labeled examples, it is impossible to manually verify all labels for large training data. We show that the notion of data value can help human experts prioritize the verification process, allowing them to review only the examples that are most likely to be contaminated. The key idea is to rank the data points according to their data values and prioritize the points with the lowest values. Following (Ghorbani and Zou, 2019)
, we perform experiments in the three settings: a Naive Bayes model trained on a spam classification dataset, a logistic regression model trained on InceptionV3 features of a flowe classification dataset, and a threelayer convolutional network trained on fashionMNIST dataset. The noise flipping ratio is 20%, 10%, and 10%, respectively. The detection performance of different data value measures is illustrated in Figure
2. We can see that the NN Shapley value outperforms other baselines for all the three settings. Also, the Shapley valuebased values, including ours, TMCShapley, and GShapley, are more effective than the LOObased values.Watermark Removal
Deep neural networks have achieved tremendous success in various fields but training these models from scratch could be computationally expensive and require a lot of training data. One may contribute a dataset and outsource the computation to a different party. How can the dataset contributor claim the ownership of the data source of a trained model? A prevalent way of addressing this question is to embed watermarks into the DNNs. There are two classes of watermarking techniques in the existing work, i.e., patternbased techniques and instancebased techniques. Specifically, patternbased techniques inject a set of samples blended with the same pattern and labeled with a specific class into the training set; the data contributor can later verify the data source of the trained model by checking the output of the model for an input with the pattern. Instancebased techniques, by contrast, inject individual training samples labeled with a specific class as watermarks and the verification can be done by inputting the same samples into the trained model. Some examples of the watermarks generated by the patternbased and instancebased techniques are illustrated in Figure 8 in the appendix. In this experiment, we will demonstrate that based on the data values, the model trainer is always possible to remove the watermarks. The idea is that the watermarks should have low data values by nature, since they contribute little to predict the normal validation data. Note that this experiment constitutes a new type of attack, which might of independent interest itself.
For the patternbased watermark removal experiment, we consider three settings: two convolutional networks trained on 1000 images from fashion MNIST and 10000 images from MNIST, resepctively, and a ResNet18 model trained on 1000 images from a face recognition dataset, Pubfig83. The watermark ratio is 10% for all three settings. The details about watermark patterns are provided in the appendix. Since for the last two settings, either due to large data size or model size, TMCShapley, GShapley, and Leaveoneout all fail to produce value estimates in 3 hours, we compare our algorithm only with the rest of baselines. The results are shown in Figure
3. We can see that NNShapley achieves similar performance to TMCShapley when the time complexity of TMCShapley is acceptable and outperforms all other baselines.For the instancebased watermark removal experiment, we consider the following settings: a logistic regression model trained on 10000 images from MNIST, a convolution network trained on 3000 images from CIFAR10, and ResNet18 trained on 3000 images from SVHN. The watermark ratio is 10%, 3%, and 3%, respectively. The results of our experiment are shown in Figure 4. For this experiment, we found that both watermarks and benign data tend to have low values on some validation points; therefore, watermarks and benign data are not quite separable in terms of the average value across the validation set. We propose to compute the max value across the validation set for each training point, which we call maxKNNShapley, and remove the points with lowest maxKNNShapley values. The intuition is that outofdistribution samples are inessential to predict all normal validation points and thus the maximum of their Shapley values with respect to different validation points should be low. The results show that the maxKNNShapley is more effective measure to detect instancebased watermarks than all other baselines.
Data Summarization
Data summarization aims to find a small subset that well represents a massive dataset. The utility of the succinct summaries should be comparable to that of the whole dataset. This is a natural application of data values, since we can just summarize the dataset by removing lowvalue points. We consider two settings for this experiment: a single hidden layer neural network (Chen et al., 2018b)
trained on UCI Adult Census dataset and a ResNet18 trained on Tiny ImageNet. For the first setting, we randomly pick
individuals as a training set, where half of them have income exceeding per year. We use another a balanced dataset of size to calculate the values of training data and data points to evaluate the model performance. As Figure 5 (a) shows, our method remains a high performance even reducing of the whole training set. The data selected by the Shapley valuebased data values are more helpful to boost model performance than the LOObased value measures. TMCShapley and GShapley can achieve slightly better performance than NNShapley. In the second setting, we use 95000 points as the training set, 5000 points to calculate the values, and 10000 points as the heldout set. The result in Figure 5 (b) shows that NNShapley is able to maintain model performance even after removing 40% of the whole training set. However, TMCShapley, GShapley, and LOO cannot finish in 24 hours and hence are omitted from the figure.Active Data Acquisition
Annotated data is often hard and expensive to obtain, particularly for specialized domains where only experts can provide reliable labels. Active data acquisition aims to ease the data collection process by automatically deciding which instances an annotator should label to train a model as efficiently and effectively as possible. We assume that we start with a small training set and compute the data values. Then, we train a random forest to predict the value for new data based on their features and select the points with highest values. For this experiment, we consider two setups. For the first setup, we synthesize a dataset with disparate data qualities by add noise to partial MNIST. For the second, we use Tiny ImageNet, which has realistic variation of data quality, for evaluation. In the first setup, we choose
images from MNIST and add Gaussian white noise into half of them. We use another 100 images to calculate the training data values and a heldout dataset of size 1000 to evaluate the performance. In the second setup, we separate the training set into two parts with 5000 training points and 95000 new points. We calculate values of 2500 data points in the training set based on the other 2500 points. Both Figure
6 (a) and (b) show that new data selected based on NNShapley value can improve model accuracy faster than the rest of baselines.Domain Adaptation
Machine learning models are know to have limited capablity of generalizing learned knowledge to new datasets or environments. In practice, there is a need to transfer the model from a source domain where sufficient training data is available to a target domain where few labeled data are available. Domain adaptation aims to better leverage the dataset from one domain for the prediction tasks in another domain. We will show that data values will be useful for this task. Specifically, we can compute the values of data in the source domain with respect to a heldout set from the target domain. In this way, the data values will reflect how useful different training points are for the task in the target domain. We then train the model based only on positivevalue points in the source domain and evaluate the model performance in the target one. We perform experiments on three datasets, namely, MNIST, USPS, and SVHN. For the transfer between USPS and MNIST, we use the same experiment setting as (Ghorbani and Zou, 2019). We firstly train a multinomial logistic regression classifier. We randomly sample 1000 images from the source domain as training set, calculate the values for the training data based on 1000 instances from the target domain, and evaluate the performance of the model on another 1000 target domain data. The results are summarized in Table 1, which shows that NNShapley performs better than TMCShapley. For the transfer between SVHN and MNIST, we pick training data from SVHN, train a ResNet18 model (He et al., 2016), and evaluate the performance on the whole test set of MNIST. NNShapley is able to implement on the data of this scale efficiently while TMCShapley algorithm cannot finish in 48 hours.
Method  MNIST USPS  USPS MNIST  SVHN MNIST 

NNShapley  31.7% 48.40%  23.35% 30.25%  9.65% 20.25% 
TMCShapley  31.7% 44.90%  23.35% 29.55%   
NNLOO  31.7% 39.40%  23.35% 24.52%  9.65% 11.70% 
LOO  31.7% 29.40%  23.35% 23.53%   
6 Conclusion
In this paper, we propose an efficient heuristic to approximate the Shapley value based on NN proxies, which for the first time, enables the data valuation for largescale dataset and large model size. We demonstrate the utility of the approximate Shapley values produced by our algorithm on a variety of applications, from noisy label detection, watermark removal, data summarization, active data acquisition, to domain adaption. We also compare with the existing Shapley value approximation algorithms, and show that our values can achieve comparable performance while the computation is much more efficient and scalable. We characterize the advantage of the Shapley value over the LOO value from a theoretical perspective and show that it is preferable in terms of predictive power and discriminative power under differentially private learning algorithms.
References
 Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §4.2.
 Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1615–1631. Cited by: §E.1.
 Stability and generalization. Journal of machine learning research 2 (Mar), pp. 499–526. Cited by: §4.2.
 Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728. Cited by: §E.1.
 Differentially private data generative models. arXiv preprint arXiv:1812.02274. Cited by: §5.3.
 Data shapley: equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868. Cited by: §1, §1, §2.2, §3.2, §4, §5.1, §5.1, §5.3, §5.3, §5.3.
 Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §D.2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §5.3.  Efficient taskspecific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment 12 (11), pp. 1610–1623. Cited by: §1, Theorem 1.
 Towards efficient data valuation based on the shapley value. arXiv preprint arXiv:1902.10275. Cited by: §D.2, §1, §1, §3.2, Lemma 3.
 Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1885–1894. Cited by: §2.1.
 Learning with differential privacy: stability, learnability and the sufficiency and necessity of ERM principle. External Links: 1502.06309 Cited by: §4.2.
Appendix A Proof of Theorem 2
See 2
Proof.
The proof relies on dissecting the term and () in the definition of orderpreserving property.
Consider any two points . We start by analyzing . Let the th nearest neighbor of in be denoted by . Moreover, we will use to indicate that is closer to the validation point than , i.e., . We first analyze the expectation of the above utility difference by considering the following cases:
(1) . In this case, adding or into will not change the nearest neighbors to and therefore . Hence, .
(2) . In this case, including the point i into T can expel the Kth nearest neighbor from the original set of K nearest neighbors while including the point will not change the nearest neighbors. In other words, and . Hence, .
(3) . In this case, including the point or will both change the original nearest neighbors in by excluding the th nearest neighbor. Thus, and . It follows that .
Combining the three cases discussed above, we have
(8)  
(9)  
(10) 
Note that removing the first term in (A) cannot change the sign of the sum in (A). Hence, when analyzing the sign of (A), we only need to focus on the second term:
(11) 
Since , the sign of (11) will be determined by the sign of . Hence, we get
(12) 
Now, we switch to the analysis of the value difference. By Theorem 1, it holds for the NN Shapley value that
(13)  
(14)  
(15) 
Note that for all . Thus, if and , the minimum of (A) is achieved when for all and the minimum value is , which is greater than zero. On the other hand, if and , then the maximum of (A) is achieved when for all and the maximum value is , which is less than zero.
Summarizing the above analysis, we get that has the same sign as . By (12), it follows that also shares the same sign as .
To analyze the sign of the NN LOO value difference, we first write out the expression for the NN LOO value difference:
(16) 
Therefore, has the same sign as and only when .
∎
Appendix B Proof of Theorem 3
We will need the following lemmas on group differential privacy for the proof of Theorem 3.
Lemma 2.
If is differentially private with respect to one change in the database, then is differentially private with respect to changes in the database.
Lemma 3 (Jia et al. [2019b]).
For any , the difference in Shapley values between and is
(17) 
See 3
Proof.
Let be the set with one element in replaced by a different value. Let the probability density/mass defined by and be and , respectively. Using Lemma 2, for any we have
(18)  
(19)  
(20) 
It follows that
(21)  
(22) 
By symmetry, it also holds that
(23)  
(24) 
Thus, we have the following bound:
(25) 
Denoting . For the performance measure that evaluate the loss averaged across multiple validation points , we have
(26) 
Making the dependence on the training set size explicit, we can rewrite the above equation as
(27) 
By Lemma 3, we have for all ,
(28)  
(29)  
(30) 
As for the LOO value, we have
(31)  
(32) 
∎
Appendix C Comparing the LOO and the Shapley Value for Stable Learning algorithms
An algorithm has uniform stability with respect to the loss function if for all , where denotes the training set and denotes the one by removing th element of .
Theorem 4.
For a learning algorithm with uniform stability , where is the size of the training set and is some constant. Let the performance measure be . Then,
(33) 
and
(34) 
Proof.
By the definition of uniform stability, it holds that
(35) 
Using Lemma 3, we have we have for all ,
(36)  
(37)  
(38) 
Recall the bound on the harmonic sequences
which gives us
As for the LOO value, we have
(39) 
∎
Appendix D Additional Experiments
d.1 Removing Data Points of High Value
We remove training points from most valuable to least valuable and evaluate the accuracy of the model trained with remaining data. The experimental setting, including datasets and models, is the same as that for the data summarization experiment. Figure 7 compares the power of different data value measures to detect highutility data. We can see that removing highvalue points based on KNNShapley, GShapley, and TMCShapley can all significantly reduce the model performance. It indicates that these three heuristics are effective in detecting most valuable training points. On UCI Census, TMCShapley achieves the best performance and NNShapley performs similarly to GShapley. On Tiny ImageNet, both TMCShapley and GShapley cannot finish in 24 hours and are therefore omitted from comparison. Compared with the random baseline, NNShapley can lead to a much faster performance drop when removing highvalue points.
d.2 Rank Correlation with Ground Truth Shapley Value
We perform experiments to compare the ground truth Shapley value of raw data and the value estimates produced by different heuristics. The ground truth Shapley value is computed using the group testing algorithm in [Jia et al., 2019b], which can approximate the Shapley value with provable error bounds. We use a fullyconnected neural network with three hidden layers as the target model. Following the setting in [Jia et al., 2019b], we construct a size1000 training set using MNIST, which contains both benign and adversarial examples, as well as a size100 validation set with pure adversarial examples. The adversarial examples are generated by the Fast Gradient Sign Method [Goodfellow et al., 2014]. This construction is meant to simulate data with different levels of usefulness. In the above setting, the adversarial examples in the training set should be more valuable than the benign data because they can improve the prediction on adversarial examples. Note that the
NNShapley computes the Shapley value of deep features extracted from the penultimate layer.
The rank correlation of NNShapley and GShapley with the ground truth Shapley value is 0.08 and 0.024 with pvalue 0.0046 and 0.4466, respectively. It shows that both heuristics may not be able to preserve the exact rank of the ground truth Shapley value. Since TMCShapley cannot finish in a week for this model and data size, we omit it from comparison. We further apply some local smoothing to the values and check whether these heuristics can produce large values for data groups with large Shapley values. Specifically, we compute 1 to 100 percentiles of the Shapley values, find the group of data points within each percentile interval, and compute the average Shapley value as well as the average heuristic values for each group. The rank correlation of the average NNShapley and the average GShapley with the average ground truth Shapley value for these data groups are 0.22 and 0.002 with pvalue 0.0293, 0.9843, respectively. We can see that although ignoring the data contribution for feature learning, NNShapley can better preserve the rank of the Shapley value in a macroscopic level than GShapley.
Appendix E Experiment Details
e.1 Watermark Removal
For patternbased watermarks, we adopted two types of patterns: one is to change the pixel values at the corner of an image [Chen et al., 2018a] (Figure (a)a), another is to blend a specific word (like "TEST") in an image (Figure (b)b). The first pattern is used in the experiments on fashion MNIST and MNIST, which contain single channel images. The second pattern is used in the experiment on Pubfig83, which contains multichannel images.
For instancebased watermarks, we used the same watermarks as [Adi et al., 2018], which are a set of abstract images with specific assigned labels. The example of a trigger image is shown in Figure (c)c.
Comments
There are no comments yet.