1 Background
An acquiescent method to evaluate data importance/value to a model is leaveoneout (LOO) which compares the difference between the accuracy of the model trained by the entire dataset and the accuracy of the model trained by the entire dataset minus one data tuple [4]
. However, LOO does not satisfy all the ideal properties that we expect for the data valuation. For example, in support vector machine (SVM), given a data tuple
in a dataset, if there is an exact copy in the dataset, removing from this dataset does not change the predictor at all since is still there. Therefore, LOO will assign zero (or a very low) value to regardless of how important is.Shapley value is a concept in cooperative game theory, which was named in honor of Lloyd Shapley [9]. Shapley value is the only value division scheme used for compensation allocation that meets three desirable criteria: group rationality, fairness, and additivity [8]. Combining with its flexibility to support different utility functions, Shapley value has been extensively employed in the filed of data market [1, 2, 7, 8]. One major challenge of applying Shapley value is its prohibitively high computational complexity. Evaluating the exact Shapley value involves the computation of the marginal contribution of each data tuple to every coalition, which is complete [6]. Such exponential computation is clearly impractical for evaluating a large number of training data tuples. Even worse, for machine learning tasks, evaluating the utility function is extremely expensive as machine learning tasks need to train models. The worst case is that we need to train models for computing the exact Shapley value for each data tuple.
2 Approximate Shapley Value
Shapley value based compensation is a prevalently adopted approach mostly due to its theoretical properties, especially the fairness. Shapley value measures the marginal improvement of model utility contributed by , averaged over all possible coalitions of the data tuples. The formal Shapley value definition of is shown as follows.
(1) 
where is the utility of the model trained by a (sub)set of the data tuples, and the model utility is tested on the testing dataset.
Monte Carlo Simulation Method. Since the exact Shapley value computation is based on enumeration which is prohibitively expensive, we adopt a commonly used Monte Carlo simulation method [3, 6]
to compute the approximate Shapley value. We first sample random permutations of the data tuples, and then scan the permutation from the first data tuple to the last data tuple and calculate the marginal contribution of every new data tuple. Repeating the same procedure over multiple Monte Carlo permutations, the final estimation of the Shapley value is simply the average of all the calculated marginal contributions. This Monte Carlo sampling gives an unbiased estimate of Shapley value. In practical applications, we generate Monte Carlo estimates until the average has empirically converged and the experiments show that the estimates converge very quickly. Therefore, Monte Carlo simulation method can control the degree of approximation, i.e., the more permutations, the better the accuracy. The detailed algorithm is shown in Algorithm
1, where is the number of permutations. The larger the , the more accurate the computed Shapley value.3 Shapley Value Definitions
The existing work [2, 7] takes the original value when computing the marginal contribution of each data tuple to each coalition. The formal definition named as Original Shapley Value is shown in Equation (1). A toy experiment result is shown in Figure 1. For the ease of visualization, we take the first two features (sepal length and sepal width) and the first two species (Iris Setosa and Iris Versicolour) from the classic Iris dataset [5]. We randomly choose 20 data tuples as the testing dataset and the remaining as the training dataset. Iris Setosa (Versicolour) is shown in red (blue) color. The model utility is measured by SVM. The data tuples in support vectors are shown in circle, and the top 10 data tuples with the highest (lowest) Shapley value are shown in square (pentagram) denoted by highest10 (lowest10).
Alternatively, if the marginal contribution of a data tuple to a coalition is a negative value, we may take zero rather than the original value. The formal definition named as Zero Shapley Value is shown in Equation (2) and the toy experiment result is shown in Figure 2.
(2) 
We observe that whether the effect of adding a new data tuple is positive or negative, as long as the effect is large enough, the newly added data tuple is significant to the trained model. Therefore, we propose a new definition named as Absolute Shapley Value. In absolute Shapley value, if the marginal contribution of a data tuple to a coalition is a negative value, we take its absolute value. The formal definition is shown in Equation 3 and the toy experiment result is shown in Figure 3.
(3) 
Observation.
Shapley value is employed to evaluate data importance to a model. Therefore, the more important the data tuple, the higher the Shapley value. In SVM classifier model, generally speaking, the data tuples in the support vectors should have higher Shapley value. In original Shapley value of Figure
1, highest10 contains only 4 data tuples in the support vectors. Even worse, lowest10 also contains 3 data tuples in the support vectors. Differently, in zero Shapley value of Figure 2, highest 10 contains 5 data tuples in the support vectors. Absolute Shapley value has the best performance in which highest10 contains 6 data tuples in the support vectors. Furthermore, lowest 10 in absolute Shapley value is more compact and in the middle than lowest10 in zero Shapley value, i.e., the data tuples in lowest10 of absolute Shapley value are more unimportant than the data tuples in lowest10 of zero Shapley value.4 Experiment
We ran experiments on a machine with an Intel Core i78700K and two NVIDIA GeForce GTX 1080 Ti running Ubuntu with 64GB memory. We compute Shapley value of each data tuple based on the following definitions in Python 3.6.
4.1 Performance on Iris dataset
The Iris flower dataset [5] is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher, which consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features are sepal length, aepal width, petal length, and petal width, in centimeters.
We employ two classic machine learning models, Logistic Regression (LR) and Support Vector Machine (SVM), to evaluate the effectiveness of different Shapley value definitions. We first compute Shapley value of each data tuple using the Monte Carlo method in the training dataset. We then train two predictive models from scratch based on the top
training data tuples with the highest Shapley value and the top training data tuples with the lowest Shapley value, respectively.The model accuracy is reported in Table 1. Surprisingly, for both LR and SVM, the accuracy of the model trained by the top K training data tuples with the highest ORI equals to the accuracy of the model trained by the top K training data tuples with the lowest ORI. However, this phenomenon exactly validates our guess that “whether the effect of adding a new data tuple is positive or negative, as long as the effect is large enough, the newly added data tuple is significant to the trained model”. Recall Figure 1, not only the data tuple in highest10 can be contained in support vectors, but also the data tuple in lowest10. Furthermore, for both LR and SVM, the model trained by the top K training data tuples with the lowest ABS has the lowest accuracy, which verifies that the data tuples with the lowest ABS are truly unimportant. Recall Figure 3, lowest10 data tuples lie in the middle of the dataset, and they are unimportant to the model accuracy. Therefore, ABS outperforms ORI and ZERO in terms of evaluating data importance.
LR  LR  SVM  SVM  

(highestK)  (lowestK)  (highestK)  (lowestK)  
ORI  100.00%  100.00%  93.33%  93.33% 
ZERO  100.00%  63.33%  96.66%  90.00% 
ABS  100.00%  60.00%  96.66%  90.00% 
5 Conclusion and Future Work
In this paper, for the first time, we define absolute Shapley value for evaluating data importance in training machine learning models. The experimental results of LR and SVM on Iris dataset show that absolute Shapley value definition outperforms original Shapley value and zero Shapley value in terms of evaluating data importance. For future work, we would like to explore the effectiveness of different Shapley value definitions on more machine learning models and more representative datasets.
References
 [1] Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 701–726. ACM, 2019.

[2]
Marco Ancona, Cengiz Öztireli, and Markus H. Gross.
Explaining deep neural networks with a polynomial time algorithm for shapley value approximation.
In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, pages 272–281, 2019.  [3] Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & OR, 36(5):1726–1730, 2009.
 [4] Gavin C. Cawley and Nicola L. C. Talbot. Efficient leaveoneout crossvalidation of kernel fisher discriminant classifiers. Pattern Recognition, 36(11):2585–2592, 2003.
 [5] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
 [6] S. Shaheen Fatima, Michael J. Wooldridge, and Nicholas R. Jennings. A linear approximation method for the shapley value. Artif. Intell., 172(14):1673–1699, 2008.
 [7] Amirata Ghorbani and James Y. Zou. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, pages 2242–2251, 2019.
 [8] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. Efficient taskspecific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 12(11):1610–1623, 2019.
 [9] Lloyd S Shapley. A value for nperson games. Contributions to the Theory of Games, 2(28):307–317, 1953.