DeepAI
Log In Sign Up

Absolute Shapley Value

03/23/2020
by   Jinfei Liu, et al.
Emory University
0

Shapley value is a concept in cooperative game theory for measuring the contribution of each participant, which was named in honor of Lloyd Shapley. Shapley value has been recently applied in data marketplaces for compensation allocation based on their contribution to the models. Shapley value is the only value division scheme used for compensation allocation that meets three desirable criteria: group rationality, fairness, and additivity. In cooperative game theory, the marginal contribution of each contributor to each coalition is a nonnegative value. However, in machine learning model training, the marginal contribution of each contributor (data tuple) to each coalition (a set of data tuples) can be a negative value, i.e., the accuracy of the model trained by a dataset with an additional data tuple can be lower than the accuracy of the model trained by the dataset only. In this paper, we investigate the problem of how to handle the negative marginal contribution when computing Shapley value. We explore three philosophies: 1) taking the original value (Original Shapley Value); 2) taking the larger of the original value and zero (Zero Shapley Value); and 3) taking the absolute value of the original value (Absolute Shapley Value). Experiments on Iris dataset demonstrate that the definition of Absolute Shapley Value significantly outperforms the other two definitions in terms of evaluating data importance (the contribution of each data tuple to the trained model).

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

02/11/2022

The Shapley Value in Machine Learning

Over the last few years, the Shapley value, a solution concept from coop...
09/27/2014

How good is the Shapley value-based approach to the influence maximization problem?

The Shapley value has been recently advocated as a method to choose the ...
04/18/2019

The Shapley Value of Tuples in Query Answering

We investigate the application of the Shapley value to quantifying the c...
03/09/2021

Portfolio risk allocation through Shapley value

We argue that using the Shapley value of cooperative game theory as the ...
10/24/2020

Collaborative Machine Learning with Incentive-Aware Model Rewards

Collaborative machine learning (ML) is an appealing paradigm to build hi...
05/15/2021

Cohort Shapley value for algorithmic fairness

Cohort Shapley value is a model-free method of variable importance groun...
12/29/2019

The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries

The Shapley value is a conventional and well-studied function for determ...

1 Background

An acquiescent method to evaluate data importance/value to a model is leave-one-out (LOO) which compares the difference between the accuracy of the model trained by the entire dataset and the accuracy of the model trained by the entire dataset minus one data tuple [4]

. However, LOO does not satisfy all the ideal properties that we expect for the data valuation. For example, in support vector machine (SVM), given a data tuple

in a dataset, if there is an exact copy in the dataset, removing from this dataset does not change the predictor at all since is still there. Therefore, LOO will assign zero (or a very low) value to regardless of how important is.

Shapley value is a concept in cooperative game theory, which was named in honor of Lloyd Shapley [9]. Shapley value is the only value division scheme used for compensation allocation that meets three desirable criteria: group rationality, fairness, and additivity [8]. Combining with its flexibility to support different utility functions, Shapley value has been extensively employed in the filed of data market [1, 2, 7, 8]. One major challenge of applying Shapley value is its prohibitively high computational complexity. Evaluating the exact Shapley value involves the computation of the marginal contribution of each data tuple to every coalition, which is -complete [6]. Such exponential computation is clearly impractical for evaluating a large number of training data tuples. Even worse, for machine learning tasks, evaluating the utility function is extremely expensive as machine learning tasks need to train models. The worst case is that we need to train models for computing the exact Shapley value for each data tuple.

A number of approximation methods have been developed to overcome the computational hardness of finding the exact Shapley value. The most representative method is Monte Carlo method [3, 6], which is based on the random sampling of permutations.

2 Approximate Shapley Value

Shapley value based compensation is a prevalently adopted approach mostly due to its theoretical properties, especially the fairness. Shapley value measures the marginal improvement of model utility contributed by , averaged over all possible coalitions of the data tuples. The formal Shapley value definition of is shown as follows.

(1)

where is the utility of the model trained by a (sub)set of the data tuples, and the model utility is tested on the testing dataset.

Monte Carlo Simulation Method. Since the exact Shapley value computation is based on enumeration which is prohibitively expensive, we adopt a commonly used Monte Carlo simulation method [3, 6]

to compute the approximate Shapley value. We first sample random permutations of the data tuples, and then scan the permutation from the first data tuple to the last data tuple and calculate the marginal contribution of every new data tuple. Repeating the same procedure over multiple Monte Carlo permutations, the final estimation of the Shapley value is simply the average of all the calculated marginal contributions. This Monte Carlo sampling gives an unbiased estimate of Shapley value. In practical applications, we generate Monte Carlo estimates until the average has empirically converged and the experiments show that the estimates converge very quickly. Therefore, Monte Carlo simulation method can control the degree of approximation, i.e., the more permutations, the better the accuracy. The detailed algorithm is shown in Algorithm

1, where is the number of permutations. The larger the , the more accurate the computed Shapley value.

input :  and .
output : Shapley value for each data in .
1 initialize ;
2 for k=1 to  do
3       we have a training dataset ordered in , ;
4       for i=1 to n do
5             ;
6             ;
7            
8      
Algorithm 1 Monte Carlo Shapley value computation.

3 Shapley Value Definitions

The existing work [2, 7] takes the original value when computing the marginal contribution of each data tuple to each coalition. The formal definition named as Original Shapley Value is shown in Equation (1). A toy experiment result is shown in Figure 1. For the ease of visualization, we take the first two features (sepal length and sepal width) and the first two species (Iris Setosa and Iris Versicolour) from the classic Iris dataset [5]. We randomly choose 20 data tuples as the testing dataset and the remaining as the training dataset. Iris Setosa (Versicolour) is shown in red (blue) color. The model utility is measured by SVM. The data tuples in support vectors are shown in circle, and the top 10 data tuples with the highest (lowest) Shapley value are shown in square (pentagram) denoted by highest10 (lowest10).

Figure 1: Original Shapley Value

Alternatively, if the marginal contribution of a data tuple to a coalition is a negative value, we may take zero rather than the original value. The formal definition named as Zero Shapley Value is shown in Equation (2) and the toy experiment result is shown in Figure 2.

(2)

We observe that whether the effect of adding a new data tuple is positive or negative, as long as the effect is large enough, the newly added data tuple is significant to the trained model. Therefore, we propose a new definition named as Absolute Shapley Value. In absolute Shapley value, if the marginal contribution of a data tuple to a coalition is a negative value, we take its absolute value. The formal definition is shown in Equation 3 and the toy experiment result is shown in Figure 3.

(3)
Figure 2: Zero Shapley Value.
Figure 3: Absolute Shapley Value.

Observation.

Shapley value is employed to evaluate data importance to a model. Therefore, the more important the data tuple, the higher the Shapley value. In SVM classifier model, generally speaking, the data tuples in the support vectors should have higher Shapley value. In original Shapley value of Figure

1, highest10 contains only 4 data tuples in the support vectors. Even worse, lowest10 also contains 3 data tuples in the support vectors. Differently, in zero Shapley value of Figure 2, highest 10 contains 5 data tuples in the support vectors. Absolute Shapley value has the best performance in which highest10 contains 6 data tuples in the support vectors. Furthermore, lowest 10 in absolute Shapley value is more compact and in the middle than lowest10 in zero Shapley value, i.e., the data tuples in lowest10 of absolute Shapley value are more unimportant than the data tuples in lowest10 of zero Shapley value.

4 Experiment

We ran experiments on a machine with an Intel Core i7-8700K and two NVIDIA GeForce GTX 1080 Ti running Ubuntu with 64GB memory. We compute Shapley value of each data tuple based on the following definitions in Python 3.6.

  • ORI: Original Shapley value definition in Equation (1).

  • ZERO: Zero Shapley value definition in Equation (2).

  • ABS: Absolute Shapley value definition in Equation (3).

4.1 Performance on Iris dataset

The Iris flower dataset [5] is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher, which consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features are sepal length, aepal width, petal length, and petal width, in centimeters.

We employ two classic machine learning models, Logistic Regression (LR) and Support Vector Machine (SVM), to evaluate the effectiveness of different Shapley value definitions. We first compute Shapley value of each data tuple using the Monte Carlo method in the training dataset. We then train two predictive models from scratch based on the top

training data tuples with the highest Shapley value and the top training data tuples with the lowest Shapley value, respectively.

The model accuracy is reported in Table 1. Surprisingly, for both LR and SVM, the accuracy of the model trained by the top K training data tuples with the highest ORI equals to the accuracy of the model trained by the top K training data tuples with the lowest ORI. However, this phenomenon exactly validates our guess that “whether the effect of adding a new data tuple is positive or negative, as long as the effect is large enough, the newly added data tuple is significant to the trained model”. Recall Figure 1, not only the data tuple in highest10 can be contained in support vectors, but also the data tuple in lowest10. Furthermore, for both LR and SVM, the model trained by the top K training data tuples with the lowest ABS has the lowest accuracy, which verifies that the data tuples with the lowest ABS are truly unimportant. Recall Figure 3, lowest10 data tuples lie in the middle of the dataset, and they are unimportant to the model accuracy. Therefore, ABS outperforms ORI and ZERO in terms of evaluating data importance.

LR LR SVM SVM
(highestK) (lowestK) (highestK) (lowestK)
ORI 100.00% 100.00% 93.33% 93.33%
ZERO 100.00% 63.33% 96.66% 90.00%
ABS 100.00% 60.00% 96.66% 90.00%
Table 1: Model accuracy () on Iris dataset.

5 Conclusion and Future Work

In this paper, for the first time, we define absolute Shapley value for evaluating data importance in training machine learning models. The experimental results of LR and SVM on Iris dataset show that absolute Shapley value definition outperforms original Shapley value and zero Shapley value in terms of evaluating data importance. For future work, we would like to explore the effectiveness of different Shapley value definitions on more machine learning models and more representative datasets.

References

  • [1] Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 701–726. ACM, 2019.
  • [2] Marco Ancona, Cengiz Öztireli, and Markus H. Gross.

    Explaining deep neural networks with a polynomial time algorithm for shapley value approximation.

    In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 272–281, 2019.
  • [3] Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & OR, 36(5):1726–1730, 2009.
  • [4] Gavin C. Cawley and Nicola L. C. Talbot. Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognition, 36(11):2585–2592, 2003.
  • [5] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • [6] S. Shaheen Fatima, Michael J. Wooldridge, and Nicholas R. Jennings. A linear approximation method for the shapley value. Artif. Intell., 172(14):1673–1699, 2008.
  • [7] Amirata Ghorbani and James Y. Zou. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 2242–2251, 2019.
  • [8] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 12(11):1610–1623, 2019.
  • [9] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.