1 Introduction
Given a learning algorithm, evaluating the learning performance for any arbitrary input dataset traditionally requires training with the dataset and then calculating the performance of the trained model. The computational complexity of this evaluation process is often dominated by the training part, which could be very expensive. On the other hand, a variety of applications could potentially benefit from an efficient process of estimating the learning performance. Examples include noisy or adversarial data removal, data summarization (Wang et al., 2021b), and active learning (Wang et al., 2021a), which can all be done with the same recipe – evaluating the performance of many subsets of data points and only retraining the subset that achieves the highest learning performance.
One natural idea of estimating learning performance without training is to learn the mapping from a dataset to the resulting learning performance. We will refer to the mapping as the data utility function hereinafter. Each training sample for learning a data utility function consists of a set of data points and the corresponding learning performance. While a learned data utility function allows efficient learning performance prediction for any given input dataset, the training samples for learning data utility function are expensive to construct, as each sample requires training on a dataset. Hence, there is an urgent need to understand the sample efficiency of the learning problem. Recent work shows promising empirical results that for many common learning algorithms, the data utility functions can be accurately learned with a relatively small amount of samples (Wang et al., 2021a). The authors conjecture that the observed sample efficiency might be caused by the approximate submodularity of data utility functions (see Figure 1); specifically, most times, an extra training data contributes less to learning performance as the training size increases. There is a rich line of work that studies the sample complexity bound for submodular functions (Balcan and Harvey, 2011; Gupta et al., 2013; Feldman and Vondrák, 2016; Feldman et al., 2020). However, there still lacks an understanding of (1) how to rigorously characterize functions with “approximate” submodularity and (2) how the approximate submodularity affects learnability.
In this paper, we propose a relaxed submodularity condition that can well describe data utility functions of common learning algorithms. We show that functions satisfying the “relaxed submodularity” condition is subsumed by a larger function class called selfbounding functions, which enjoy the dimensionfree concentration bound (Boucheron et al., 2000). Leveraging the concentration bound, we show that these functions can be learned with constantfactor approximation error under the PMAC learning framework. Notably, our result applies to general selfbounding functions, which may be of independent interest. Furthermore, we show that selfbounding functions can be approximated by lowdegree polynomials and then leverage this structural result to show that they can learned with samples under the PAC learning framework. Both PMAC and PAC framework are widely used in analyzing the learnability of realvalued functions.
To showcase the usefulness of data utility function learning, we study its application to data valuation as a concrete example. The goal of data valuation is to quantify the usefulness of each data source for downstream analysis tasks. Data valuation has gained a lot of attention recently because it can be used to inform the implementation of recent policies aimed at enabling individuals control over how their data is used and monetized by thirdparties (Voigt and Von dem Bussche, 2017). Moreover, the characterization of data value enables users to filter out poor quality data and to identify data that are important to collect in the future (Ghorbani and Zou, 2019; Jia et al., 2019b, a, c; Wang et al., 2020)
. Existing work leverages concepts from cooperative game theory, such as the Shapley value and least cores, as a fair notion for data value. However, computing or even approximating these data value notions requires evaluating the learning performance of many different combinations of data sources, which could be very computationally expensive. We propose to learn a data utility function and use it to predict the learning performance of a subset of data sources without retraining. We conduct extensive experiments and show that data utility learning can significantly improve the approximation accuracy of Shapley value and least cores. Particularly, our approach can be extended to many other cooperative games where the characteristic functions are close to submodular and the evaluation is expensive, such as bidders’ valuation functions in combinatorial auctions
(Lehmann et al., 2006).2 Related Work
Data utility learning was first proposed in Wang et al. (2021a), where the trained data utility models are used for active learning. Wang et al. (2021a) observe that most of the common data utility functions can be learned efficiently with relatively small amount of samples. They conjecture that the learnability of data utility functions is due to their “approximate” submodularity property. We note that the term “approximately submodularity” has different definitions in the literature (Horel and Singer, 2016; Hassidim and Singer, 2018; Das and Kempe, 2018; Chierichetti et al., 2020). However, these definitions do not consider special properties of data utility function observed empirically; e.g., the deviation from submodularity may depend on the sizes difference between two datasets. More importantly, no existing works have studied the learnability of approximate submodular functions. This paper presents a rigorous characterization and learnability result for the “relaxed” submodularity functions that can welldescribe data utility functions. Chierichetti et al. (2015) and Feige et al. (2020) present algorithms for learning the closest modular function via queries, which is orthogonal to our focus.
Selfbounding functions were first introduced by Boucheron et al. (2000). McDiarmid and Reed (2006) further refine and introduce a more general notion of selfbounding functions. The most notable property of selfbounding functions is the dimensionfree concentration bound, derived by the entropy method (Boucheron et al., 2000, 2003; McDiarmid and Reed, 2006; Boucheron et al., 2009). Particularly, submodular functions are special instances of selfbounding functions. Similar concentration bound for submodular functions were independently derived in Balcan and Harvey (2011) using Talagrand’s inequality, unaware of the connection with selfbounding functions. Our paper enriches the learnability study of selfbounding functions and explores the novel application of selfbounding functions to characterizing data utility functions.
Gametheoretic formulations of data valuation have become popular in recent years. Particularly, Shapley value has been widely used as a data value notion (Ghorbani and Zou, 2019; Jia et al., 2019b, a, c; Wang et al., 2020), as it uniquely satisfies a set of desirable properties. Recently, Yan and Procaccia (2020)
propose to use the Least core as an alternative of Shapley value for data valuation. However, exact computation of Shapley and Least core values are NPhard under most of the settings, which limits their applicability in realworld data valuation applications even at scale of hundreds of data points. Several heuristics, such as TMCShapley
(Ghorbani and Zou, 2019), GShapley (Ghorbani and Zou, 2019), KNNShapley
(Jia et al., 2019a), have been proposed to approximate the Shapley value. Despite their computational advantage, they are biased in nature, limiting their applicability to sensitive applications such as designing monetary rewards for data sharing or assigning responsibility for ML decisions. On the other hand, unbiased estimators of the Shapley value such as Permutation Sampling
(Maleki, 2015) and Group Testing (Jia et al., 2019b) still require a large number of learning performance evaluations for any descent approximation accuracy.The technique proposed in this paper is not another approximation heuristic for the Shapley value or the Least core. Rather, it is a natural way to improve the unbiased Shapley or Least core estimation, and it is compatible with any approximation heuristics that require a significant amount of data utility samples, such as permutation sampling (Maleki, 2015).
A similar idea of boosting the Shapley value estimation through learning the characteristic functions has been explored in cooperative game abstraction (CGA) Yan et al. (2020). However, their technique may not be directly applicable to data valuation due to the several issues. We provide a detailed comparison between CGA and our technique for data valuation in Section 5.
3 Technical Preliminaries
Since data utility functions are set functions, we start by reviewing important classes of set functions relevant to describing data utility functions.
Set functions defined on the power set of can be equivalently viewed as functions defined on the binary hypercube .
Definition 1.
A set function is

monotone, if for all .

submodular, if for all .
Note that a submodular function is not necessarily monotone. An equivalent definition of submodularity is diminishing returns: for every , and every , a submodular function satisfies
(1) 
Next, we introduce selfbounding functions. Let
denote the onehot encoding of
, i.e.,is a vector of size
with zeros on all but the th dimension. Let denote the XOR operation.Definition 2 (McDiarmid and Reed (2006)).
For a function and any , let . Then is selfbounding, if for every and ,
Selfbounding functions is the most general class of functions currently known to enjoy strong “dimensionfree” (i.e. independent of ) concentration bounds.
There are two commonly used analysis models for learning realvalued functions. One is the generalization of Valiant’s PAC learning model to realvalued functions (Valiant, 1984). Specifically, for any unknown function and target error
, with nonnegligible probability a PAC learner should output a hypothesis
such that . However, this does not differentiate between the case of having low error on most of the distribution and high error on just a few points, versus the case of moderate error everywhere. A more demanding model is PMAC learning introduced by Balcan and Harvey (2011), where a learner has to output a hypothesis that multiplicatively approximates the target function. Specifically, a PMAC learner with approximation factor and error is an algorithm which outputs a hypothesis that satisfies . We say that multiplicatively approximates over distribution in this case. While in general both models do not make assumptions on the distribution, our analysis will be focused on the fixed, uniform distribution over
. In practice, for a data utility learning task, one typically has control over the sampling distribution of the training samples and therefore can set the sampling distribution to be uniform.4 Learning Data Utility Functions
4.1 Characterizing Data Utility Functions
A data utility function maps a set of data points to a real number indicating the utility of the data set. In the ML context, the utility of a dataset is typically measured by the performance ML model trained on the dataset, such as test accuracy. It has been empirically observed in prior works (Wang et al., 2021a) that many data utility functions can be efficiently learned with a relatively small amount of samples. One conjecture for the cause of such efficient learnability is that data utility functions are close to submodular functions (e.g., Figure 1) for many common learning algorithms (Wang et al., 2021a).
Synthetic  USPS  IRIS  IRIS  MNIST  MNIST  CIFAR10  

+Logistic  +SVM  +Logistic  +SVM  +Logistic  +CNN  +CNN  
68.15%  66.15%  94.1%  92.5%  50.95%  50.5%  50.8%  
99%  98.7%  99.85%  99.4%  66.4%  66.85%  79.6%  
100%  100%  100%  100%  97.9%  95.8%  99.6% 
We first check how well submodularity describes the properties of common data utility functions. Specifically, we uniformly sample some subsets from training data and evaluate their corresponding utility. We then calculate the percentage of these utility samples satisfying the submodularity condition in (1). The result is provided in the first row of Table 1. It shows that submodularity is an overly stringent condition to describe data utility functions because for many learning algorithms and datasets, the majority of utility samples do not satisfy it. Therefore, we propose a refined condition for modeling common data utility functions.
Definition 3 (relaxed submodularity).
We say that a set function satisfies relaxed submodularity if for every , and every ,
(2) 
In particular, when , the bias term
attempts to model the phenomenon that when the datasets are small, the contributions of an additional data point to the datasets have larger variance. Hence, when two data sets get more different in sizes, the contributions to two sets might deviate more from the exact submodularity property. When
, this condition reduces to exact submodularity definition in (1). Table 1 shows that this relaxed condition better aligns with the property of data utility functions in practice. The following theorem shows that any set functions with range and satisfy the relaxed submodularity condition is selfbounding.Theorem 1.
Every function satisfies
for all and , then is selfbounding.
Note that we can easily transform a data utility function to range through normalization, as data utility functions typically have fixed ranges, e.g., classification accuracy in percentage always lies in 0 to 100. Hence, the learnability problem for data utility functions can be reduced to that for self bounding functions.
4.2 Learning Selfbounding Functions
Existing works on learning selfbounding functions either make extra assumptions of monotonicity Feldman et al. (2020) or focus exclusively on self bounding functions Feldman and Vondrák (2016). As shown in Figure 1, data utility functions in general are not monotone. Moreover, as shown in Table 1, they can be better characterized by selfbounding functions with nonzero . Hence, the learnability results in prior works cannot be directly applied to data utility functions. In this work, we extend these results to a more general case of selfbounding functions and relax the monotonicity constraint.
Our first result shows that under the PMAC framework, selfbounding functions can be learned with constantfactor errors. Formally, we show the following result.
Theorem 2.
There exists an algorithm that given the access to random and uniform examples of an selfbounding functions where , with probability at least , outputs a function which is a multiplicative approximation of over the uniform distribution, where . Further, runs in time and uses examples.
The intuition of the proof is inspired by (Balcan and Harvey, 2011) as follows. Since the value of any selfbounding functions is tightly concentrated around its expectation under uniform distribution, the constant function equal the empirical mean gives a good approximation to . This result indicates that when we have access to random samples of a selfbounding function, it is possible to learn it up to constant multiplicative error with constant sample complexity and runtime. Note that in this result, there is a tradeoff between the approximation factor and error term, as constant function cannot approximate every selfbounding function arbitrarily well.
Our second learnability result is derived under a less demanding PAC learning framework. We first extend the structural results in Feldman et al. (2020) for selfbounding functions to the more general setting of selfbounding functions. Specifically, we find that any selfbounding functions can be approximated by lowdegree polynomials in norm.
Theorem 3.
Let be an selfbounding function and . For there exists a set of indices I of size and a polynomial of degree d over variables in such that .
Based on this structural result, we give a learnability result for selfbounding functions in a distributionspecific PAC learning model.
Theorem 4.
Let be the class of all selfbounding functions from to . Let . There exists an algorithm that given and access to random uniform examples of any target function , with probability at least , outputs a hypothesis , s.t. . Further, runs in time and uses examples.
The learning algorithm used to prove this theorem is based on polynomial regression over all monomials of degree , inspired by Feldman and Vondrák (2016) and Feldman et al. (2020). Unlike Theorem 2, the approximation error can be arbitrarily small. Moreover, the sample complexity scale logarithmically with the input dimension , which could still be quite efficient.
The practical implication of these learnability results is that, data utility functions could potentially be efficiently learned with limited amount of random samples of data utilities.
5 Applying Data Utility Learning to Data Valuation
As a concrete example to demonstrate the usefulness of data utility learning, we explore a novel application in data valuation.
Shapley value is a widely used data value notion nowadays (Ghorbani and Zou, 2019; Jia et al., 2019b, a, c; Wang et al., 2020). Recent work has also proposed to use the Least core^{1}^{1}1Least core may not be unique. In this paper, when we talk about the least core, we always refer to the least core vector that has the smallest norm, following the tiebreaking rule in the original literature (Yan and Procaccia, 2020). (Yan and Procaccia, 2020), another famous solution concept in cooperative game theory, as a viable alternative to Shapley value. Both of the two data value notions have rigorous fairness guarantees, thereby making them particularly attractive in sensitive applications involving assigning monetary rewards or attributing responsibility based on data value. Given data points and a data utility function , the definition of Shapley value and Least core are respectively given as follows:
It can be seen from these definitions that the exact calculation of both the Shapley value and the least core requires evaluating the data utility function on every possible subset of the training data, which is . Existing works on data valuation have largely focused on making the calculation more efficient. The major idea underlying the existing works is to evaluate the data utility only on some sampled subsets and then estimate the data value based on the utility samples. One key aspect ignored by all the prior works is that with the utility samples, we can potentially learn a model for the data utility function, which can be in turn used to predict the utility for subsets that are not sampled.
This paper investigates the potential of using data utility learning to further improve the efficiency of various existing data value approximation heuristics. We first build an abstraction for data value approximation heuristics, which can incorporate all the existing unbiased heuristics Maleki (2015); Jia et al. (2019b); Yan and Procaccia (2020) as well as some of the biased heuristics involving data utility sampling Ghorbani and Zou (2019). These heuristics consist of two components: a sampler and an estimator . The heuristic sampler takes a dataset and sampling budget as the input, and outputs a sample set of data utilities where each . The heuristic estimator then takes the sampled utility set and compute the estimation of the corresponding solution concept (i.e., the Shapley value or the Least core). Algorithm 1 summarizes our algorithm for accelerating data valuation with data utility learning. Our algorithm leverages the utility sample set already available in existing heuristics and use it for data utility learning (denoted as ). Once we obtain the learned data utility model, we can use it to predict the utilities for more data subsets. At last, we use the combination of the true utility samples and predicted utility samples as input to . Since querying the utility model is very efficient, the additional predicted utility samples can almost be acquired for free! In practice, evaluation budget could be set much greater than in Algorithm 1.
Comparison with Cooperative Game Abstraction.
Yan et al. (2020)
propose a similar idea which approximates Shapley value through learning the characteristic functions in a cooperative game by using a parametric model called cooperative game abstraction (CGA). CGA is essentially a linear regression model where the variables are small subsets of players. The order of CGA refers to the largest size of player groups included as variables in the linear regression model. One advantage of CGA is that it can recover Shapley value directly from its parameters if the function of CGA perfectly matches the characteristic function. Hence,
Yan et al. (2020) propose to learn the characteristic function using CGA with certain amount of samples, and then compute Shapley value through the trained parameters of CGA. For data valuation problem, the characteristic function is the data utility function. In this sense, Yan et al. (2020)’s technique can be viewed as a special case of data utility learning where the data utility model is a linear regression. However, we argue that CGA may not be a suitable model for data utility learning with the two main reasons: (i) CGA is only suitable for certain types of games where the interactions only exist for small groups of players, e.g., the team performance in basketball games. On the contrary, interactions between large groups of data points still exist and might be strong. Thus, CGA is not a suitable model for data utility learning in nature, and we confirm this point in Section 6. (ii) CGA has a poor scalability even for one with low order, e.g. for a thirdorder CGA with 2000 players, the number of total parameters is , while the number of total parameters of afullyconnected neural network is only only around half of it (
). We also note that the parameters of CGA only has a close form for computing Shapley value, but not the Least core.6 Evaluation
6.1 Evaluation Settings
Protocol.
We evaluate the effectiveness of data utility learning for boosting the performance of the Shapley and Least core estimation heuristics. We first assess the performance of Shapley value and Least core estimators on small enough datasets to be able to directly calculate the true data value and then evaluate the estimation error of different heuristics. For larger datasets, since it is impractical to compute the exact data value, we compare the performance of data value estimates on data removal task, following Ghorbani and Zou (2019); Jia et al. (2019a, c); Wang et al. (2020); Yan and Procaccia (2020). Besides, we also evaluate the performance on data group valuation, where the Shapley or least core values are assigned to a group of data rather than a single point. For all experiments, we train the data utility model using a threelayer MLP model.
Baselines.
For Shapley value estimation, we consider the two existing unbiased estimators as our baselines: (1) Permutation Sampling (Perm) (Maleki, 2015), which is a Monte Carlo algorithm; (2) Group Testing (GT) (Jia et al., 2019b), which is an improved Monte Carlo algorithm based on group testing theory; and (3) CGA Yan et al. (2020), which improve the efficiency of the Shapley value estimation by using a linear combination of the utilities on small subsets to estimate the utility function. For the least core estimation, the baseline is the Monte Carlo (MC) approach (Yan and Procaccia, 2020), which is the only known unbiased estimator.
We defer the implementation details of the data utility learning as well as baseline approaches to supplementary materials. For every experiment we conduct, we repeat each heuristic computation for 10 times to obtain the error bars.
6.2 Results
Error Simulation. We first test the performance of different Shapley value or least core estimation heuristics on tiny datasets with less than 15 training data points. In this case, it becomes computationally feasible to compute the exact
Shapley and least core values and thus we can calculate the estimation error for different heuristics. We experiment on both synthetic and natural dataset. To generate the synthetic dataset, we sample 10 data points from a 2dimensional Gaussian distribution whose parameters are
and, respectively. The labels are assigned based on the sign of the sum of the features. A logistic regression classifier trained on the 10 data points could achieve around 80% test accuracy. For the natural dataset, we randomly sample 15 data points from the famous Iris dataset
(Pedregosa et al., 2011). A supportvector machine (SVM) classifier trained on the 15 data points achieve around 94% test accuracy. After training the utility model, we obtain the estimated data utilities for all the subsets not sampled before. We then estimate the Shapley and Least core values using the exact Shapley and Least core calculation formula, with the estimated data utilities.
We show estimation errors in Figure 2, and defer the results for and errors to supplementary materials. As we can see, with relatively small amount of sampled utilities (e.g., 200300 for synthetic dataset and 500 for Iris dataset), data utility learning can significantly reduce the estimation errors for both the Shapley and Least core values. Utility prediction per se may introduce additional computational costs; yet, these costs are often negligible compared to model retraining. Moreover, CGAbased Shapley estimation performs poorly, because there are highorder data interactions in data utility functions which CGA cannot successfully capture.
Data Removal. We evaluate the Shapley/Least core value estimations on larger datasets by comparing the performance on data removal task. Specifically, we remove the most (least) valuable portion of dataset and see how the utility of the remaining dataset changes. Intuitively, a better data value estimate can better identify the importance of data points. Hence, when the data with the highest (lowest) value estimates are removed, a better data value estimation method would lead to a faster (slower) performance drop. Similar to the previous experiment, we experiment on both synthetic and realworld datasets. For the synthetic data generation, we sample 200 data points from a 50dimensional Gaussian distribution, where the 50dimensional parameters are sampled uniformly from , and each data point is labeled by the sign of the sum of the data point vector. The utility of the dataset is defined by the test accuracy of a logistic regression classifier trained on the dataset. For realworld data experiment, we select 2000 data points from PubFig83 (Pinto et al., 2011)
dataset and the utility refers to the Top5 accuracy of a convolutional neural network trained on it (for facial recognition). Since in error simulation experiment we shown that CGA does not perform well in estimating Shapley values for data valuation, and since CGA is not scalable to larger datasets as discussed in Section
5, we do not compare it as a baseline for this experiment.We experiment with different training budget , and we set the evaluation budget as utility model evaluation is much faster than retraining the model. We show the result of for synthetic data and for PubFig83. This is clearly a lowresource setting, as computing the exact Shapley or least core require times of training on synthetic data and times of training on PubFig83 data. As can be seen in Figure 3 (a)(b) and (d)(e), the estimation heuristics equipped with data utility learning consistently performs better on identifying the most and least valuable data points. This means that the Shapley and Least core values estimated with predicted utilities are at least more effective in predicting the most and least valuable (in a sense) data points in these settings. As a side note, the Shapley value estimated by Permutation sampling is superior to the Least core estimated by MonteCarlo algorithm, which does not agree with the experiment results in Yan and Procaccia (2020). Therefore, an interesting future direction is to better understand the cases where Shapley value performs better than least core and vice versa.
Group Data Removal. We also experiment on estimating Shapley and Least core values for groups of data points. This is a potentially more realistic and useful setting since in practice, more than one data records will be collected from one party. We divide Adult dataset (Dua and Graff, 2017) into 200 groups. The size of each group is varied. The proportion of data points allocated to each group is sampled from Dirichlet distribution where has 30 in all dimensions. This design ensures that there are moderate amount of variations in group sizes. A utility sample in this setting refers to the performance of a logistic regression trained on the data points provided by a coalition of groups. This is a more challenging setting for data utility learning in nature, since the “diminishing return” property of data utility function may be violated due to the variation of group sizes.
Since Adult dataset is a highly unbalanced dataset, we use F1score as the utility metric. We show the results for here and defer the other settings to supplementary materials. As we can see from Figure 3 (c) and (f), the heuristics equipped with data utility learning is again favorable to both of the Shapley and Least core estimation.
7 Conclusion
This paper presents the first learnability analysis for data utility functions. We propose a relaxed submodularity notion that can well describe the properties of data utility functions for popular ML models. We show that the relaxed submodular functions belong to selfbounding functions and then characterize the bounds on learning error under both PMAC and PAC learning models. Finally, we study the application of data utility learning in data valuation and propose a generic framework that can significantly improve the accuracy of all existing unbiased Shapley/least core estimation methods.
Limitations & Future Work. It is worth noting that many settings studied in our experiments in Section 6 go much beyond the scope of our theoretical learnability results in Section 4.2. Specifically, the theoretical results are derived from simple learning algorithms such as regression and focused on uniform utility sample distribution. Meanwhile, our empirical results suggest that learning of data utility function could be dataefficient with neural networkbased models, nonuniform utility sample distribution, and even functions not so close to being submodular. Closing these gaps between theory and practices would be important future works. With this paper, we march one step closer towards rigorous understanding of data utility learning and hope to inspire more research in this direction in ML community.
References

[1]
(2011)
Learning submodular functions.
In
Proceedings of the fortythird annual ACM symposium on Theory of computing
, pp. 793–802. Cited by: §1, §2, §3, §4.2.  [2] (2003) Concentration inequalities using the entropy method. Annals of probability 31 (3), pp. 1583–1614. Cited by: §2.
 [3] (2009) On concentration of selfbounding functions. Electronic Journal of Probability 14, pp. 1884–1899. Cited by: §2, Theorem 6.
 [4] (2000) A sharp concentration inequality with applications. Random Structures & Algorithms 16 (3), pp. 277–292. Cited by: §1, §2.
 [5] (2015) Approximate modularity. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1143–1162. Cited by: §2.
 [6] (202010) On Additive Approximate Submodularity. arXiv eprints, pp. arXiv:2010.02912. External Links: 2010.02912 Cited by: §2.
 [7] (2018) Approximate submodularity and its applications: subset selection, sparse approximation and dictionary selection. The Journal of Machine Learning Research 19 (1), pp. 74–107. Cited by: §2.
 [8] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.2.
 [9] (2020) Approximate modularity revisited. SIAM Journal on Computing 49 (1), pp. 67–97. Cited by: §2.
 [10] (2020) Tight bounds on l1 approximation and learning of selfbounding functions. Theoretical Computer Science 808, pp. 86–98. Cited by: Appendix C, Appendix C, Appendix C, §1, §4.2, §4.2, §4.2, Definition 4, Theorem 8, Theorem 9.
 [11] (2016) Optimal bounds on approximation of submodular and xos functions by juntas. SIAM Journal on Computing 45 (3), pp. 1129–1170. Cited by: Appendix C, Appendix C, Appendix C, §1, §4.2, §4.2, Definition 4.
 [12] (1998) Boolean functions with low average sensitivity depend on few coordinates. Combinatorica 18 (1), pp. 27–35. Cited by: Appendix C.
 [13] (2019) Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. Cited by: §1, §2, §5, §5, §6.1.
 [14] (2013) Privately releasing conjunctions and the statistical query barrier. SIAM Journal on Computing 42 (4), pp. 1494–1520. Cited by: §1.
 [15] (2018) Optimization for approximate submodularity. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 394–405. Cited by: §2.
 [16] (2016) Maximization of approximately submodular functions.. In NIPS, Vol. 16, pp. 3045–3053. Cited by: §2.
 [17] (2019) Efficient taskspecific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619. Cited by: §1, §2, §5, §6.1.

[18]
(2019)
Towards efficient data valuation based on the shapley value.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 1167–1176. Cited by: §1, §2, §5, §5, §6.1.  [19] (2019) Scalability vs. utility: do we have to sacrifice one for the other in data importance quantification?. arXiv preprint arXiv:1911.07128. Cited by: §1, §2, §5, §6.1.
 [20] (2006) Combinatorial auctions with decreasing marginal utilities. Games and Economic Behavior 55 (2), pp. 270–296. Cited by: §1.
 [21] (2015) Addressing the computational issues of the shapley value with applications in the smart grid. Ph.D. Thesis, University of Southampton. Cited by: §2, §2, §5, §6.1.
 [22] (2006) Concentration for selfbounding functions and an inequality of talagrand. Random Structures & Algorithms 29 (4), pp. 549–557. Cited by: §2, Definition 2.
 [23] (2011) Scikitlearn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: §6.2.

[24]
(2011)
Scaling up biologicallyinspired computer vision: a case study in unconstrained face recognition on facebook
. In CVPR 2011 WORKSHOPS, pp. 35–42. Cited by: §6.2.  [25] (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §D.1.
 [26] (1984) A theory of the learnable. Communications of the ACM 27 (11), pp. 1134–1142. Cited by: §3.
 [27] (2017) The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing 10, pp. 3152676. Cited by: §1.
 [28] (2021) Oneround active learning. arXiv preprint arXiv:2104.11843. Cited by: Figure 1, §1, §1, §2, §4.1, Table 1.
 [29] (2020) A principled approach to data valuation for federated learning. In Federated Learning, pp. 153–167. Cited by: §1, §2, §5, §6.1.
 [30] (2021) A unified framework for taskdriven data quality management. arXiv preprint arXiv:2106.05484. Cited by: §1.
 [31] (2020) Evaluating and rewarding teamwork using cooperative game abstractions. arXiv preprint arXiv:2006.09538. Cited by: §2, §5, §6.1.
 [32] (2020) If you like shapley then you’ll love the core. Manuscript. Cited by: §2, §5, §5, §6.1, §6.1, §6.2, footnote 1.
Appendix A Proof for Theorem 1
Theorem 1.
Every function satisfies
for all and , then is selfbounding.
Proof.
Since the range of function lies in , the requirement that is trivially satisfied. Our goal is to show that, for every and every , we have
(3) 
Given an input , denote as the vector that set the th coordinate to be binary bit , and we also denote as the vector that set all coordinates in to . Let to be the set of indices where , and let to be the set of indices where .
WLOG, let . By relaxed submodularity condition, we have
The last inequality is due to .
Similarly, let where , we have
Therefore we have
∎
Appendix B Proof of Theorem 2
In order to prove Theorem 2, we will need the following results.
Theorem 5 (Hoeffding Bound).
Let
be independent random variables. Assume that
for Then, for the empirical mean of these variables we have the inequalitiesTheorem 6 (Concentration Bound for Selfbounding Functions [3]).
If where are independently random and is selfbounding, and then

for any ;

for any .
Now we prove Theorem 2.
Theorem 2.
There exists an algorithm that given the access to random and uniform examples of an selfbounding functions where , with probability at least , outputs a function which is a multiplicative approximation of over the uniform distribution, where . Further, runs in time and uses examples.
Proof.
We show that Algorithm 2 outputs a function which achieves the desired multiplicative approximation to any selfbounding function with probability at least .
Case 1: suppose the empirical mean , then , which leads to . Therefore
(5)  
where (5) is due to . Therefore, with probability at least , Algorithm 2 achieves approximation factor of on all but fraction of the distribution.
Case 2: suppose the empirical mean . In this case, we know that . Since we know that for all , we will only care about the probability of a randomly drawn sample over uniform distribution violating
, which is calculated as
This concludes the proof of Theorem 2. ∎
Appendix C Proof of Theorem 3 and 4
Our structural result in Theorem 3 builds on the work of [11] and [10]. [10] shows that every realvalued function with low total influence can be approximated by a lowdegree polynomial. [11] and [10] also extends the result in [12] on the approximatibility of realvalued functions by functions of a small number of variables (referred as juntas in the original work), where the bound on the number of variables also depends on the influence of the target function. Thus, the key idea to prove Theorem 3 is to show that every selfbounding has low total influence.
We first formally define the influence of a function defined on . Given an input , denote as the vector that set the th coordinate to be binary bit . For a function and index , we define . The norm of are defined by where is the uniform distribution.
Definition 4 (Influences [11, 10]).
For a function , , we define the influence of variable as . We define and refer to it as the total influence of .
We now show that every selfbounding function has total influence of at most . This proof follows from Lemma 4.2 in [11].
Lemma 7.
Let function be an selfbounding function, then .
Proof.
By the definition of total influence, we have
(6)  
(7)  
where (7) is due to each difference is counted twice in (6), but only counted once in (7).
Further, we notice that
By using the property of selfbounding functions, we can then upper bound by
∎
The following structural result has been proved in [10].
Theorem 8 ([10]).
Let be a function and . For every , let and for Then and there exists a polynomial of degree over variables in I such that .
Theorem 3 can thus be obtained by plugging in the total influence of selfbounding functions .
The learnability result in Theorem 4 is a simple extension of Theorem 1.2 in [10] and Theorem 7.5 in [11]. Namely, as our goal is to learn an selfbounding function, there are no inductive bias if we choose the hypothesis class as all functions with low total influence. As in [11] and [10], the learning algorithm used to prove prove the result is polynomial regression over all monomials of degree .
Theorem 9 (Extension of Theorem 1.2 in [10]).
Let be the class of all functions to and total influence of at most . There exists an algorithm that given and access to random uniform examples of any target function , with probability at least , outputs a hypothesis , s.t. . Further, runs in time and uses examples.
Theorem 4 immediately follows from substituting the bound of the influence of selfbounding functions in the above theorem.
Appendix D Experiment Details and Additional Results
d.1 Implementation Details
For error simulation experiment in Section 6.2
, we use a small MLP model with 2 hidden layers as the utility learning model, where the number of neurons in the hidden layers are 20 and 10, respectively. For data point and group removal experiment, we use MLP models with 3 hidden layers as the utility learning model. Each fullyconnected layer has LeakyReLU as the activation function and is regularized by Dropout
[25]. We use Adam optimizer with learning rate, minibatch size 32 to train all of the utility models mentioned above for up to 800 epochs.
For fair comparisons, we always fix the same training budget for different baselines. For Group Testing, we leverage half of the training budget to estimate the Shapley value of the last data point and the other half of the training budget to estimate the differences in Shapley value between data points. We use CVXOPT^{2}^{2}2https://cvxopt.org/ library to solve the constrained minimization problem in the least core approximation. We set the degree of CGA as 2 in the experiment. We use SGD optimizer with learning rate , batch size 32 to train the CGA model.
d.2 Additional Results
For Error Simulation experiment, we show estimation errors in Figure 4. For these two less stringent error metrics, we can see that data utility learning still greatly reduces the Shapley/Least core estimation error with relatively small amount of sampled utilities.
We also show additional results for group data removal task with training budget and in Figure 5. Similar to the case when , the Shapley/Least core estimations boosted by data utility learning still significantly outperform the vanilla ones for larger .