Learnability of Learning Performance and Its Application to Data Valuation

07/13/2021 ∙ by Tianhao Wang, et al. ∙ Harvard University Virginia Polytechnic Institute and State University 1

For most machine learning (ML) tasks, evaluating learning performance on a given dataset requires intensive computation. On the other hand, the ability to efficiently estimate learning performance may benefit a wide spectrum of applications, such as active learning, data quality management, and data valuation. Recent empirical studies show that for many common ML models, one can accurately learn a parametric model that predicts learning performance for any given input datasets using a small amount of samples. However, the theoretical underpinning of the learnability of such performance prediction models is still missing. In this work, we develop the first theoretical analysis of the ML performance learning problem. We propose a relaxed notion for submodularity that can well describe the behavior of learning performance as a function of input datasets. We give a learning algorithm that achieves a constant-factor approximation under certain assumptions. Further, we give a learning algorithm that achieves arbitrarily small error based on a newly derived structural result. We then discuss a natural, important use case of learning performance learning – data valuation, which is known to suffer computational challenges due to the requirement of estimating learning performance for many data combinations. We show that performance learning can significantly improve the accuracy of data valuation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a learning algorithm, evaluating the learning performance for any arbitrary input dataset traditionally requires training with the dataset and then calculating the performance of the trained model. The computational complexity of this evaluation process is often dominated by the training part, which could be very expensive. On the other hand, a variety of applications could potentially benefit from an efficient process of estimating the learning performance. Examples include noisy or adversarial data removal, data summarization (Wang et al., 2021b), and active learning (Wang et al., 2021a), which can all be done with the same recipe – evaluating the performance of many subsets of data points and only retraining the subset that achieves the highest learning performance.

Figure 1: Figure from Wang et al. (2021a)

: An illustration of data utility functions for a logistic regression model trained on MNIST dataset. The three lines are each generated through randomly sample and add a data point to the training set each time, and record the logistic regression accuracy with the increase of training data size.

One natural idea of estimating learning performance without training is to learn the mapping from a dataset to the resulting learning performance. We will refer to the mapping as the data utility function hereinafter. Each training sample for learning a data utility function consists of a set of data points and the corresponding learning performance. While a learned data utility function allows efficient learning performance prediction for any given input dataset, the training samples for learning data utility function are expensive to construct, as each sample requires training on a dataset. Hence, there is an urgent need to understand the sample efficiency of the learning problem. Recent work shows promising empirical results that for many common learning algorithms, the data utility functions can be accurately learned with a relatively small amount of samples (Wang et al., 2021a). The authors conjecture that the observed sample efficiency might be caused by the approximate submodularity of data utility functions (see Figure 1); specifically, most times, an extra training data contributes less to learning performance as the training size increases. There is a rich line of work that studies the sample complexity bound for submodular functions (Balcan and Harvey, 2011; Gupta et al., 2013; Feldman and Vondrák, 2016; Feldman et al., 2020). However, there still lacks an understanding of (1) how to rigorously characterize functions with “approximate” submodularity and (2) how the approximate submodularity affects learnability.

In this paper, we propose a relaxed submodularity condition that can well describe data utility functions of common learning algorithms. We show that functions satisfying the “relaxed submodularity” condition is subsumed by a larger function class called -self-bounding functions, which enjoy the dimension-free concentration bound (Boucheron et al., 2000). Leveraging the concentration bound, we show that these functions can be learned with constant-factor approximation error under the PMAC learning framework. Notably, our result applies to general self-bounding functions, which may be of independent interest. Furthermore, we show that -self-bounding functions can be approximated by low-degree polynomials and then leverage this structural result to show that they can learned with samples under the PAC learning framework. Both PMAC and PAC framework are widely used in analyzing the learnability of real-valued functions.

To showcase the usefulness of data utility function learning, we study its application to data valuation as a concrete example. The goal of data valuation is to quantify the usefulness of each data source for downstream analysis tasks. Data valuation has gained a lot of attention recently because it can be used to inform the implementation of recent policies aimed at enabling individuals control over how their data is used and monetized by third-parties (Voigt and Von dem Bussche, 2017). Moreover, the characterization of data value enables users to filter out poor quality data and to identify data that are important to collect in the future (Ghorbani and Zou, 2019; Jia et al., 2019b, a, c; Wang et al., 2020)

. Existing work leverages concepts from cooperative game theory, such as the Shapley value and least cores, as a fair notion for data value. However, computing or even approximating these data value notions requires evaluating the learning performance of many different combinations of data sources, which could be very computationally expensive. We propose to learn a data utility function and use it to predict the learning performance of a subset of data sources without retraining. We conduct extensive experiments and show that data utility learning can significantly improve the approximation accuracy of Shapley value and least cores. Particularly, our approach can be extended to many other cooperative games where the characteristic functions are close to submodular and the evaluation is expensive, such as bidders’ valuation functions in combinatorial auctions

(Lehmann et al., 2006).

2 Related Work

Data utility learning was first proposed in Wang et al. (2021a), where the trained data utility models are used for active learning. Wang et al. (2021a) observe that most of the common data utility functions can be learned efficiently with relatively small amount of samples. They conjecture that the learnability of data utility functions is due to their “approximate” submodularity property. We note that the term “approximately submodularity” has different definitions in the literature (Horel and Singer, 2016; Hassidim and Singer, 2018; Das and Kempe, 2018; Chierichetti et al., 2020). However, these definitions do not consider special properties of data utility function observed empirically; e.g., the deviation from submodularity may depend on the sizes difference between two datasets. More importantly, no existing works have studied the learnability of approximate submodular functions. This paper presents a rigorous characterization and learnability result for the “relaxed” submodularity functions that can well-describe data utility functions. Chierichetti et al. (2015) and Feige et al. (2020) present algorithms for learning the closest modular function via queries, which is orthogonal to our focus.

Self-bounding functions were first introduced by Boucheron et al. (2000). McDiarmid and Reed (2006) further refine and introduce a more general notion of -self-bounding functions. The most notable property of self-bounding functions is the dimension-free concentration bound, derived by the entropy method (Boucheron et al., 2000, 2003; McDiarmid and Reed, 2006; Boucheron et al., 2009). Particularly, submodular functions are special instances of self-bounding functions. Similar concentration bound for submodular functions were independently derived in Balcan and Harvey (2011) using Talagrand’s inequality, unaware of the connection with self-bounding functions. Our paper enriches the learnability study of -self-bounding functions and explores the novel application of -self-bounding functions to characterizing data utility functions.

Game-theoretic formulations of data valuation have become popular in recent years. Particularly, Shapley value has been widely used as a data value notion (Ghorbani and Zou, 2019; Jia et al., 2019b, a, c; Wang et al., 2020), as it uniquely satisfies a set of desirable properties. Recently, Yan and Procaccia (2020)

propose to use the Least core as an alternative of Shapley value for data valuation. However, exact computation of Shapley and Least core values are NP-hard under most of the settings, which limits their applicability in real-world data valuation applications even at scale of hundreds of data points. Several heuristics, such as TMC-Shapley

(Ghorbani and Zou, 2019), G-Shapley (Ghorbani and Zou, 2019)

, KNN-Shapley

(Jia et al., 2019a)

, have been proposed to approximate the Shapley value. Despite their computational advantage, they are biased in nature, limiting their applicability to sensitive applications such as designing monetary rewards for data sharing or assigning responsibility for ML decisions. On the other hand, unbiased estimators of the Shapley value such as Permutation Sampling

(Maleki, 2015) and Group Testing (Jia et al., 2019b) still require a large number of learning performance evaluations for any descent approximation accuracy.

The technique proposed in this paper is not another approximation heuristic for the Shapley value or the Least core. Rather, it is a natural way to improve the unbiased Shapley or Least core estimation, and it is compatible with any approximation heuristics that require a significant amount of data utility samples, such as permutation sampling (Maleki, 2015).

A similar idea of boosting the Shapley value estimation through learning the characteristic functions has been explored in cooperative game abstraction (CGA) Yan et al. (2020). However, their technique may not be directly applicable to data valuation due to the several issues. We provide a detailed comparison between CGA and our technique for data valuation in Section 5.

3 Technical Preliminaries

Since data utility functions are set functions, we start by reviewing important classes of set functions relevant to describing data utility functions.

Set functions defined on the power set of can be equivalently viewed as functions defined on the binary hypercube .

Definition 1.

A set function is

  • monotone, if for all .

  • submodular, if for all .

Note that a submodular function is not necessarily monotone. An equivalent definition of submodularity is diminishing returns: for every , and every , a submodular function satisfies

(1)

Next, we introduce -self-bounding functions. Let

denote the one-hot encoding of

, i.e.,

is a vector of size

with zeros on all but the -th dimension. Let denote the XOR operation.

Definition 2 (McDiarmid and Reed (2006)).

For a function and any , let . Then is -self-bounding, if for every and ,

Self-bounding functions is the most general class of functions currently known to enjoy strong “dimension-free” (i.e. independent of ) concentration bounds.

There are two commonly used analysis models for learning real-valued functions. One is the generalization of Valiant’s PAC learning model to real-valued functions (Valiant, 1984). Specifically, for any unknown function and target error

, with non-negligible probability a PAC learner should output a hypothesis

such that . However, this does not differentiate between the case of having low error on most of the distribution and high error on just a few points, versus the case of moderate error everywhere. A more demanding model is PMAC learning introduced by Balcan and Harvey (2011), where a learner has to output a hypothesis that multiplicatively approximates the target function. Specifically, a PMAC learner with approximation factor and error is an algorithm which outputs a hypothesis that satisfies . We say that multiplicatively -approximates over distribution in this case. While in general both models do not make assumptions on the distribution

, our analysis will be focused on the fixed, uniform distribution over

. In practice, for a data utility learning task, one typically has control over the sampling distribution of the training samples and therefore can set the sampling distribution to be uniform.

4 Learning Data Utility Functions

4.1 Characterizing Data Utility Functions

A data utility function maps a set of data points to a real number indicating the utility of the data set. In the ML context, the utility of a dataset is typically measured by the performance ML model trained on the dataset, such as test accuracy. It has been empirically observed in prior works (Wang et al., 2021a) that many data utility functions can be efficiently learned with a relatively small amount of samples. One conjecture for the cause of such efficient learnability is that data utility functions are close to submodular functions (e.g., Figure 1) for many common learning algorithms  (Wang et al., 2021a).

Synthetic USPS IRIS IRIS MNIST MNIST CIFAR-10
+Logistic +SVM +Logistic +SVM +Logistic +CNN +CNN
68.15% 66.15% 94.1% 92.5% 50.95% 50.5% 50.8%
99% 98.7% 99.85% 99.4% 66.4% 66.85% 79.6%
100% 100% 100% 100% 97.9% 95.8% 99.6%
Table 1: We randomly draw two data subsets where and a data point , and check how many sampled pairs satisfy the “relaxed” submodularity in (2). We shows the percentage of sampled pairs satisfy the constraint in (2) given certain for the all datasets and models whose data utility functions are claimed to be “approximate submodular” in previous literature Wang et al. (2021a), with the exact same settings. corresponds to the case of the original definition of submodularity in (1).

We first check how well submodularity describes the properties of common data utility functions. Specifically, we uniformly sample some subsets from training data and evaluate their corresponding utility. We then calculate the percentage of these utility samples satisfying the submodularity condition in (1). The result is provided in the first row of Table 1. It shows that submodularity is an overly stringent condition to describe data utility functions because for many learning algorithms and datasets, the majority of utility samples do not satisfy it. Therefore, we propose a refined condition for modeling common data utility functions.

Definition 3 (-relaxed submodularity).

We say that a set function satisfies -relaxed submodularity if for every , and every ,

(2)

In particular, when , the bias term

attempts to model the phenomenon that when the datasets are small, the contributions of an additional data point to the datasets have larger variance. Hence, when two data sets get more different in sizes, the contributions to two sets might deviate more from the exact submodularity property. When

, this condition reduces to exact submodularity definition in (1). Table 1 shows that this relaxed condition better aligns with the property of data utility functions in practice. The following theorem shows that any set functions with range and satisfy the relaxed submodularity condition is self-bounding.

Theorem 1.

Every function satisfies

for all and , then is -self-bounding.

Note that we can easily transform a data utility function to range through normalization, as data utility functions typically have fixed ranges, e.g., classification accuracy in percentage always lies in 0 to 100. Hence, the learnability problem for data utility functions can be reduced to that for -self bounding functions.

4.2 Learning Self-bounding Functions

Existing works on learning self-bounding functions either make extra assumptions of monotonicity Feldman et al. (2020) or focus exclusively on -self bounding functions Feldman and Vondrák (2016). As shown in Figure 1, data utility functions in general are not monotone. Moreover, as shown in Table 1, they can be better characterized by self-bounding functions with non-zero . Hence, the learnability results in prior works cannot be directly applied to data utility functions. In this work, we extend these results to a more general case of -self-bounding functions and relax the monotonicity constraint.

Our first result shows that under the PMAC framework, self-bounding functions can be learned with constant-factor errors. Formally, we show the following result.

Theorem 2.

There exists an algorithm that given the access to random and uniform examples of an -self-bounding functions where , with probability at least , outputs a function which is a multiplicative -approximation of over the uniform distribution, where . Further, runs in time and uses examples.

The intuition of the proof is inspired by (Balcan and Harvey, 2011) as follows. Since the value of any self-bounding functions is tightly concentrated around its expectation under uniform distribution, the constant function equal the empirical mean gives a good approximation to . This result indicates that when we have access to random samples of a self-bounding function, it is possible to learn it up to constant multiplicative error with constant sample complexity and runtime. Note that in this result, there is a tradeoff between the approximation factor and error term, as constant function cannot approximate every self-bounding function arbitrarily well.

Our second learnability result is derived under a less demanding PAC learning framework. We first extend the structural results in Feldman et al. (2020) for -self-bounding functions to the more general setting of -self-bounding functions. Specifically, we find that any -self-bounding functions can be approximated by low-degree polynomials in -norm.

Theorem 3.

Let be an -self-bounding function and . For there exists a set of indices I of size and a polynomial of degree d over variables in such that .

Based on this structural result, we give a learnability result for -self-bounding functions in a distribution-specific PAC learning model.

Theorem 4.

Let be the class of all -self-bounding functions from to . Let . There exists an algorithm that given and access to random uniform examples of any target function , with probability at least , outputs a hypothesis , s.t. . Further, runs in time and uses examples.

The learning algorithm used to prove this theorem is based on polynomial regression over all monomials of degree , inspired by Feldman and Vondrák (2016) and Feldman et al. (2020). Unlike Theorem 2, the approximation error can be arbitrarily small. Moreover, the sample complexity scale logarithmically with the input dimension , which could still be quite efficient.

The practical implication of these learnability results is that, data utility functions could potentially be efficiently learned with limited amount of random samples of data utilities.

5 Applying Data Utility Learning to Data Valuation

As a concrete example to demonstrate the usefulness of data utility learning, we explore a novel application in data valuation.

Shapley value is a widely used data value notion nowadays (Ghorbani and Zou, 2019; Jia et al., 2019b, a, c; Wang et al., 2020). Recent work has also proposed to use the Least core111Least core may not be unique. In this paper, when we talk about the least core, we always refer to the least core vector that has the smallest norm, following the tie-breaking rule in the original literature (Yan and Procaccia, 2020). (Yan and Procaccia, 2020), another famous solution concept in cooperative game theory, as a viable alternative to Shapley value. Both of the two data value notions have rigorous fairness guarantees, thereby making them particularly attractive in sensitive applications involving assigning monetary rewards or attributing responsibility based on data value. Given data points and a data utility function , the definition of Shapley value and Least core are respectively given as follows:

It can be seen from these definitions that the exact calculation of both the Shapley value and the least core requires evaluating the data utility function on every possible subset of the training data, which is . Existing works on data valuation have largely focused on making the calculation more efficient. The major idea underlying the existing works is to evaluate the data utility only on some sampled subsets and then estimate the data value based on the utility samples. One key aspect ignored by all the prior works is that with the utility samples, we can potentially learn a model for the data utility function, which can be in turn used to predict the utility for subsets that are not sampled.

This paper investigates the potential of using data utility learning to further improve the efficiency of various existing data value approximation heuristics. We first build an abstraction for data value approximation heuristics, which can incorporate all the existing unbiased heuristics Maleki (2015); Jia et al. (2019b); Yan and Procaccia (2020) as well as some of the biased heuristics involving data utility sampling Ghorbani and Zou (2019). These heuristics consist of two components: a sampler and an estimator . The heuristic sampler takes a dataset and sampling budget as the input, and outputs a sample set of data utilities where each . The heuristic estimator then takes the sampled utility set and compute the estimation of the corresponding solution concept (i.e., the Shapley value or the Least core). Algorithm 1 summarizes our algorithm for accelerating data valuation with data utility learning. Our algorithm leverages the utility sample set already available in existing heuristics and use it for data utility learning (denoted as ). Once we obtain the learned data utility model, we can use it to predict the utilities for more data subsets. At last, we use the combination of the true utility samples and predicted utility samples as input to . Since querying the utility model is very efficient, the additional predicted utility samples can almost be acquired for free! In practice, evaluation budget could be set much greater than in Algorithm 1.

input : Data valuation heuristic , Dataset , Training budget , Evaluation budget , Training Algorithm for Data Utility Model .
output : Value allocation vector .
Sample utilities of data subsets with heuristic sampler . Train data utility model . Sample predicted utilities of data subsets . Compute the corresponding value allocation vector . return
Algorithm 1 Algorithm for Accelerating The Estimation of Data Valuation Solution Concepts

Comparison with Cooperative Game Abstraction.

Yan et al. (2020)

propose a similar idea which approximates Shapley value through learning the characteristic functions in a cooperative game by using a parametric model called cooperative game abstraction (CGA). CGA is essentially a linear regression model where the variables are small subsets of players. The order of CGA refers to the largest size of player groups included as variables in the linear regression model. One advantage of CGA is that it can recover Shapley value directly from its parameters if the function of CGA perfectly matches the characteristic function. Hence,

Yan et al. (2020) propose to learn the characteristic function using CGA with certain amount of samples, and then compute Shapley value through the trained parameters of CGA. For data valuation problem, the characteristic function is the data utility function. In this sense, Yan et al. (2020)’s technique can be viewed as a special case of data utility learning where the data utility model is a linear regression. However, we argue that CGA may not be a suitable model for data utility learning with the two main reasons: (i) CGA is only suitable for certain types of games where the interactions only exist for small groups of players, e.g., the team performance in basketball games. On the contrary, interactions between large groups of data points still exist and might be strong. Thus, CGA is not a suitable model for data utility learning in nature, and we confirm this point in Section 6. (ii) CGA has a poor scalability even for one with low order, e.g. for a third-order CGA with 2000 players, the number of total parameters is , while the number of total parameters of a

fully-connected neural network is only only around half of it (

). We also note that the parameters of CGA only has a close form for computing Shapley value, but not the Least core.

6 Evaluation

6.1 Evaluation Settings

Protocol.

We evaluate the effectiveness of data utility learning for boosting the performance of the Shapley and Least core estimation heuristics. We first assess the performance of Shapley value and Least core estimators on small enough datasets to be able to directly calculate the true data value and then evaluate the estimation error of different heuristics. For larger datasets, since it is impractical to compute the exact data value, we compare the performance of data value estimates on data removal task, following Ghorbani and Zou (2019); Jia et al. (2019a, c); Wang et al. (2020); Yan and Procaccia (2020). Besides, we also evaluate the performance on data group valuation, where the Shapley or least core values are assigned to a group of data rather than a single point. For all experiments, we train the data utility model using a three-layer MLP model.

Baselines.

For Shapley value estimation, we consider the two existing unbiased estimators as our baselines: (1) Permutation Sampling (Perm) (Maleki, 2015), which is a Monte Carlo algorithm; (2) Group Testing (GT) (Jia et al., 2019b), which is an improved Monte Carlo algorithm based on group testing theory; and (3) CGA Yan et al. (2020), which improve the efficiency of the Shapley value estimation by using a linear combination of the utilities on small subsets to estimate the utility function. For the least core estimation, the baseline is the Monte Carlo (MC) approach (Yan and Procaccia, 2020), which is the only known unbiased estimator.

We defer the implementation details of the data utility learning as well as baseline approaches to supplementary materials. For every experiment we conduct, we repeat each heuristic computation for 10 times to obtain the error bars.

6.2 Results

Error Simulation. We first test the performance of different Shapley value or least core estimation heuristics on tiny datasets with less than 15 training data points. In this case, it becomes computationally feasible to compute the exact

Shapley and least core values and thus we can calculate the estimation error for different heuristics. We experiment on both synthetic and natural dataset. To generate the synthetic dataset, we sample 10 data points from a 2-dimensional Gaussian distribution whose parameters are

and

, respectively. The labels are assigned based on the sign of the sum of the features. A logistic regression classifier trained on the 10 data points could achieve around 80% test accuracy. For the natural dataset, we randomly sample 15 data points from the famous Iris dataset

(Pedregosa et al., 2011)

. A support-vector machine (SVM) classifier trained on the 15 data points achieve around 94% test accuracy. After training the utility model, we obtain the estimated data utilities for all the subsets not sampled before. We then estimate the Shapley and Least core values using the exact Shapley and Least core calculation formula, with the estimated data utilities.

Figure 2: approximation error for Shapley and Least core values with the change of number of samples. (a), (c) for synthetic dataset with 10 data points, (b), (d) for Iris dataset with 15 data points. DUL stands for data utility learning.

We show estimation errors in Figure 2, and defer the results for and errors to supplementary materials. As we can see, with relatively small amount of sampled utilities (e.g., 200-300 for synthetic dataset and 500 for Iris dataset), data utility learning can significantly reduce the estimation errors for both the Shapley and Least core values. Utility prediction per se may introduce additional computational costs; yet, these costs are often negligible compared to model retraining. Moreover, CGA-based Shapley estimation performs poorly, because there are high-order data interactions in data utility functions which CGA cannot successfully capture.

Data Removal. We evaluate the Shapley/Least core value estimations on larger datasets by comparing the performance on data removal task. Specifically, we remove the most (least) valuable portion of dataset and see how the utility of the remaining dataset changes. Intuitively, a better data value estimate can better identify the importance of data points. Hence, when the data with the highest (lowest) value estimates are removed, a better data value estimation method would lead to a faster (slower) performance drop. Similar to the previous experiment, we experiment on both synthetic and real-world datasets. For the synthetic data generation, we sample 200 data points from a 50-dimensional Gaussian distribution, where the 50-dimensional parameters are sampled uniformly from , and each data point is labeled by the sign of the sum of the data point vector. The utility of the dataset is defined by the test accuracy of a logistic regression classifier trained on the dataset. For real-world data experiment, we select 2000 data points from PubFig83 (Pinto et al., 2011)

dataset and the utility refers to the Top-5 accuracy of a convolutional neural network trained on it (for facial recognition). Since in error simulation experiment we shown that CGA does not perform well in estimating Shapley values for data valuation, and since CGA is not scalable to larger datasets as discussed in Section

5, we do not compare it as a baseline for this experiment.

We experiment with different training budget , and we set the evaluation budget as utility model evaluation is much faster than retraining the model. We show the result of for synthetic data and for PubFig83. This is clearly a low-resource setting, as computing the exact Shapley or least core require times of training on synthetic data and times of training on PubFig83 data. As can be seen in Figure 3 (a)-(b) and (d)-(e), the estimation heuristics equipped with data utility learning consistently performs better on identifying the most and least valuable data points. This means that the Shapley and Least core values estimated with predicted utilities are at least more effective in predicting the most and least valuable (in a sense) data points in these settings. As a side note, the Shapley value estimated by Permutation sampling is superior to the Least core estimated by Monte-Carlo algorithm, which does not agree with the experiment results in Yan and Procaccia (2020). Therefore, an interesting future direction is to better understand the cases where Shapley value performs better than least core and vice versa.

Figure 3: Curves of model test performance when the best/worst data points (groups) ranked according to Shapley (SV) or Least core (LC) estimations are removed. (a), (d) for Synthetic dataset, (b), (e) for PubFig83 dataset, (c), (f) for Adult data groups. The upper row ((a)-(c)) removes the best data points (groups). The steeper the drop, the better. The bottom row ((d)-(f)) removes the worst data points (groups). The sharper the rise, the better. DUL stands for data utility learning.

Group Data Removal. We also experiment on estimating Shapley and Least core values for groups of data points. This is a potentially more realistic and useful setting since in practice, more than one data records will be collected from one party. We divide Adult dataset (Dua and Graff, 2017) into 200 groups. The size of each group is varied. The proportion of data points allocated to each group is sampled from Dirichlet distribution where has 30 in all dimensions. This design ensures that there are moderate amount of variations in group sizes. A utility sample in this setting refers to the performance of a logistic regression trained on the data points provided by a coalition of groups. This is a more challenging setting for data utility learning in nature, since the “diminishing return” property of data utility function may be violated due to the variation of group sizes.

Since Adult dataset is a highly unbalanced dataset, we use F1-score as the utility metric. We show the results for here and defer the other settings to supplementary materials. As we can see from Figure 3 (c) and (f), the heuristics equipped with data utility learning is again favorable to both of the Shapley and Least core estimation.

7 Conclusion

This paper presents the first learnability analysis for data utility functions. We propose a relaxed submodularity notion that can well describe the properties of data utility functions for popular ML models. We show that the relaxed submodular functions belong to self-bounding functions and then characterize the bounds on learning error under both PMAC and PAC learning models. Finally, we study the application of data utility learning in data valuation and propose a generic framework that can significantly improve the accuracy of all existing unbiased Shapley/least core estimation methods.

Limitations & Future Work. It is worth noting that many settings studied in our experiments in Section 6 go much beyond the scope of our theoretical learnability results in Section 4.2. Specifically, the theoretical results are derived from simple learning algorithms such as regression and focused on uniform utility sample distribution. Meanwhile, our empirical results suggest that learning of data utility function could be data-efficient with neural network-based models, non-uniform utility sample distribution, and even functions not so close to being submodular. Closing these gaps between theory and practices would be important future works. With this paper, we march one step closer towards rigorous understanding of data utility learning and hope to inspire more research in this direction in ML community.

References

  • [1] M. Balcan and N. J. Harvey (2011) Learning submodular functions. In

    Proceedings of the forty-third annual ACM symposium on Theory of computing

    ,
    pp. 793–802. Cited by: §1, §2, §3, §4.2.
  • [2] S. Boucheron, G. Lugosi, P. Massart, et al. (2003) Concentration inequalities using the entropy method. Annals of probability 31 (3), pp. 1583–1614. Cited by: §2.
  • [3] S. Boucheron, G. Lugosi, P. Massart, et al. (2009) On concentration of self-bounding functions. Electronic Journal of Probability 14, pp. 1884–1899. Cited by: §2, Theorem 6.
  • [4] S. Boucheron, G. Lugosi, and P. Massart (2000) A sharp concentration inequality with applications. Random Structures & Algorithms 16 (3), pp. 277–292. Cited by: §1, §2.
  • [5] F. Chierichetti, A. Das, A. Dasgupta, and R. Kumar (2015) Approximate modularity. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1143–1162. Cited by: §2.
  • [6] F. Chierichetti, A. Dasgupta, and R. Kumar (2020-10) On Additive Approximate Submodularity. arXiv e-prints, pp. arXiv:2010.02912. External Links: 2010.02912 Cited by: §2.
  • [7] A. Das and D. Kempe (2018) Approximate submodularity and its applications: subset selection, sparse approximation and dictionary selection. The Journal of Machine Learning Research 19 (1), pp. 74–107. Cited by: §2.
  • [8] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.2.
  • [9] U. Feige, M. Feldman, and I. Talgam-Cohen (2020) Approximate modularity revisited. SIAM Journal on Computing 49 (1), pp. 67–97. Cited by: §2.
  • [10] V. Feldman, P. Kothari, and J. Vondrák (2020) Tight bounds on l1 approximation and learning of self-bounding functions. Theoretical Computer Science 808, pp. 86–98. Cited by: Appendix C, Appendix C, Appendix C, §1, §4.2, §4.2, §4.2, Definition 4, Theorem 8, Theorem 9.
  • [11] V. Feldman and J. Vondrák (2016) Optimal bounds on approximation of submodular and xos functions by juntas. SIAM Journal on Computing 45 (3), pp. 1129–1170. Cited by: Appendix C, Appendix C, Appendix C, §1, §4.2, §4.2, Definition 4.
  • [12] E. Friedgut (1998) Boolean functions with low average sensitivity depend on few coordinates. Combinatorica 18 (1), pp. 27–35. Cited by: Appendix C.
  • [13] A. Ghorbani and J. Zou (2019) Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. Cited by: §1, §2, §5, §5, §6.1.
  • [14] A. Gupta, M. Hardt, A. Roth, and J. Ullman (2013) Privately releasing conjunctions and the statistical query barrier. SIAM Journal on Computing 42 (4), pp. 1494–1520. Cited by: §1.
  • [15] A. Hassidim and Y. Singer (2018) Optimization for approximate submodularity. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 394–405. Cited by: §2.
  • [16] T. Horel and Y. Singer (2016) Maximization of approximately submodular functions.. In NIPS, Vol. 16, pp. 3045–3053. Cited by: §2.
  • [17] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. M. Gurel, B. Li, C. Zhang, C. J. Spanos, and D. Song (2019) Efficient task-specific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619. Cited by: §1, §2, §5, §6.1.
  • [18] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M. Gürel, B. Li, C. Zhang, D. Song, and C. J. Spanos (2019) Towards efficient data valuation based on the shapley value. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    ,
    pp. 1167–1176. Cited by: §1, §2, §5, §5, §6.1.
  • [19] R. Jia, F. Wu, X. Sun, J. Xu, D. Dao, B. Kailkhura, C. Zhang, B. Li, and D. Song (2019) Scalability vs. utility: do we have to sacrifice one for the other in data importance quantification?. arXiv preprint arXiv:1911.07128. Cited by: §1, §2, §5, §6.1.
  • [20] B. Lehmann, D. Lehmann, and N. Nisan (2006) Combinatorial auctions with decreasing marginal utilities. Games and Economic Behavior 55 (2), pp. 270–296. Cited by: §1.
  • [21] S. Maleki (2015) Addressing the computational issues of the shapley value with applications in the smart grid. Ph.D. Thesis, University of Southampton. Cited by: §2, §2, §5, §6.1.
  • [22] C. McDiarmid and B. Reed (2006) Concentration for self-bounding functions and an inequality of talagrand. Random Structures & Algorithms 29 (4), pp. 549–557. Cited by: §2, Definition 2.
  • [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: §6.2.
  • [24] N. Pinto, Z. Stone, T. Zickler, and D. Cox (2011)

    Scaling up biologically-inspired computer vision: a case study in unconstrained face recognition on facebook

    .
    In CVPR 2011 WORKSHOPS, pp. 35–42. Cited by: §6.2.
  • [25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §D.1.
  • [26] L. G. Valiant (1984) A theory of the learnable. Communications of the ACM 27 (11), pp. 1134–1142. Cited by: §3.
  • [27] P. Voigt and A. Von dem Bussche (2017) The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing 10, pp. 3152676. Cited by: §1.
  • [28] T. Wang, S. Chen, and R. Jia (2021) One-round active learning. arXiv preprint arXiv:2104.11843. Cited by: Figure 1, §1, §1, §2, §4.1, Table 1.
  • [29] T. Wang, J. Rausch, C. Zhang, R. Jia, and D. Song (2020) A principled approach to data valuation for federated learning. In Federated Learning, pp. 153–167. Cited by: §1, §2, §5, §6.1.
  • [30] T. Wang, Y. Zeng, M. Jin, and R. Jia (2021) A unified framework for task-driven data quality management. arXiv preprint arXiv:2106.05484. Cited by: §1.
  • [31] T. Yan, C. Kroer, and A. Peysakhovich (2020) Evaluating and rewarding teamwork using cooperative game abstractions. arXiv preprint arXiv:2006.09538. Cited by: §2, §5, §6.1.
  • [32] T. Yan and A. D. Procaccia (2020) If you like shapley then you’ll love the core. Manuscript. Cited by: §2, §5, §5, §6.1, §6.1, §6.2, footnote 1.

Appendix A Proof for Theorem 1

Theorem 1.

Every function satisfies

for all and , then is -self-bounding.

Proof.

Since the range of function lies in , the requirement that is trivially satisfied. Our goal is to show that, for every and every , we have

(3)

Given an input , denote as the vector that set the th coordinate to be binary bit , and we also denote as the vector that set all coordinates in to . Let to be the set of indices where , and let to be the set of indices where .

WLOG, let . By relaxed submodularity condition, we have

The last inequality is due to .

Similarly, let where , we have

Therefore we have

Appendix B Proof of Theorem 2

In order to prove Theorem 2, we will need the following results.

Theorem 5 (Hoeffding Bound).

Let

be independent random variables. Assume that

for Then, for the empirical mean of these variables we have the inequalities

Theorem 6 (Concentration Bound for Self-bounding Functions [3]).

If where are independently random and is -self-bounding, and then

  • for any ;

  • for any .

input : , self-bounding parameters .
output : 
Let . Let . Case 1: if , return constant function . Case 2: if , return constant function .
Algorithm 2 Algorithm for PMAC learning

Now we prove Theorem 2.

Theorem 2.

There exists an algorithm that given the access to random and uniform examples of an -self-bounding functions where , with probability at least , outputs a function which is a multiplicative -approximation of over the uniform distribution, where . Further, runs in time and uses examples.

Proof.

We show that Algorithm 2 outputs a function which achieves the desired multiplicative approximation to any -self-bounding function with probability at least .

By Theorem 5, with probability at least we have

If , then we have

(4)

Let . By Theorem 6, we have

for a randomly drawn over uniform distribution.

Case 1: suppose the empirical mean , then , which leads to . Therefore

(5)

where (5) is due to . Therefore, with probability at least , Algorithm 2 achieves approximation factor of on all but fraction of the distribution.

Case 2: suppose the empirical mean . In this case, we know that . Since we know that for all , we will only care about the probability of a randomly drawn sample over uniform distribution violating

, which is calculated as

This concludes the proof of Theorem 2. ∎

Appendix C Proof of Theorem 3 and 4

Our structural result in Theorem 3 builds on the work of [11] and [10]. [10] shows that every real-valued function with low total influence can be approximated by a low-degree polynomial. [11] and [10] also extends the result in [12] on the approximatibility of real-valued functions by functions of a small number of variables (referred as juntas in the original work), where the bound on the number of variables also depends on the influence of the target function. Thus, the key idea to prove Theorem 3 is to show that every -self-bounding has low total influence.

We first formally define the influence of a function defined on . Given an input , denote as the vector that set the th coordinate to be binary bit . For a function and index , we define . The -norm of are defined by where is the uniform distribution.

Definition 4 (-Influences [11, 10]).

For a function , , we define the -influence of variable as . We define and refer to it as the total -influence of .

We now show that every -self-bounding function has total influence of at most . This proof follows from Lemma 4.2 in [11].

Lemma 7.

Let function be an -self-bounding function, then .

Proof.

By the definition of total -influence, we have

(6)
(7)

where (7) is due to each difference is counted twice in (6), but only counted once in (7).

Further, we notice that

By using the property of -self-bounding functions, we can then upper bound by

The following structural result has been proved in [10].

Theorem 8 ([10]).

Let be a function and . For every , let and for Then and there exists a polynomial of degree over variables in I such that .

Theorem 3 can thus be obtained by plugging in the total influence of -self-bounding functions .

The learnability result in Theorem 4 is a simple extension of Theorem 1.2 in [10] and Theorem 7.5 in [11]. Namely, as our goal is to learn an -self-bounding function, there are no inductive bias if we choose the hypothesis class as all functions with low total influence. As in [11] and [10], the learning algorithm used to prove prove the result is polynomial regression over all monomials of degree .

Theorem 9 (Extension of Theorem 1.2 in [10]).

Let be the class of all functions to and total influence of at most . There exists an algorithm that given and access to random uniform examples of any target function , with probability at least , outputs a hypothesis , s.t. . Further, runs in time and uses examples.

Theorem 4 immediately follows from substituting the bound of the influence of -self-bounding functions in the above theorem.

Appendix D Experiment Details and Additional Results

d.1 Implementation Details

For error simulation experiment in Section 6.2

, we use a small MLP model with 2 hidden layers as the utility learning model, where the number of neurons in the hidden layers are 20 and 10, respectively. For data point and group removal experiment, we use MLP models with 3 hidden layers as the utility learning model. Each fully-connected layer has LeakyReLU as the activation function and is regularized by Dropout

[25]. We use Adam optimizer with learning rate

, mini-batch size 32 to train all of the utility models mentioned above for up to 800 epochs.

For fair comparisons, we always fix the same training budget for different baselines. For Group Testing, we leverage half of the training budget to estimate the Shapley value of the last data point and the other half of the training budget to estimate the differences in Shapley value between data points. We use CVXOPT222https://cvxopt.org/ library to solve the constrained minimization problem in the least core approximation. We set the degree of CGA as 2 in the experiment. We use SGD optimizer with learning rate , batch size 32 to train the CGA model.

Figure 4: approximation error for Shapley and Least core values with the change of number of samples. (a), (b), (e), (f) for synthetic dataset with 10 data points, (c), (d), (g), (h) for Iris dataset with 15 data points. DUL stands for data utility learning.
Figure 5: Curves of model test performance when the best/worst data groups ranked according to Shapley (SV) or Least core (LC) estimations are removed. The upper row ((a), (b)) removes the best data groups. The steeper the drop, the better. The bottom row ((c), (d)) removes the worst data groups. The sharper the rise, the better. DUL stands for data utility learning.

d.2 Additional Results

For Error Simulation experiment, we show estimation errors in Figure 4. For these two less stringent error metrics, we can see that data utility learning still greatly reduces the Shapley/Least core estimation error with relatively small amount of sampled utilities.

We also show additional results for group data removal task with training budget and in Figure 5. Similar to the case when , the Shapley/Least core estimations boosted by data utility learning still significantly outperform the vanilla ones for larger .