1 Introduction
The last decade has observed remarkable progress in computer vision, as deep learning has pushed the stateoftheart on nearly every vision task in the field. The majority of this success can be attributed to supervised learning, typically leveraging a significant number of labeled training examples. Increasing the number of accurately labeled data points tends to improve model performance, as this enables models to better approximate the true distribution of the data. Here, the critical drawback is the time and cost of human annotation, particularly in domains requiring massive training datasets (e.g. generalized object classification such as ImageNet
russakovsky2015imagenet) or in which the cost of a label is high (e.g. medicine, where expensive expertise is required ghorbani2020deep; esteva2017dermatologist). Techniques that maximize how much models learn from a given training dataset have the potential to improve model performance on the task at hand  a feature which is of particular importance in datalimited and dataexpensive regimes.The field of active learning focuses on finding optimal ways of labeling data. This is achieved by coupling a learning algorithm (e.g. a deep model) with an active learner. The learning algorithm iteratively trains on a labeled batch of data, then queries the active learner, which selects the next best set of points to be labeled and fed into the learning algorithm. This process enables both more efficient learning, and learning equivalently expressive models with smaller labeled datasets, thus reducing the overall cost of annotation. To date, the field has primarily focused outside of deep learning, on models with lower expressivity and a lower computational training cost settles2009active
. Most of these methods involve labeling a single example at a time, then retraining the model entirely from scratch  a method which is intractable for largescale convolutional neural networks (CNNs) and other deep models. As a result, recent advances in active learning have been in the
largebatch setting, such that at each iteration of the algorithm, a large batch of new examples are labeled ren2020survey.Largebatch active learning methods work through the following process. First, they compute a heuristic score of the
utility of each potential unlabeled data point. Then, they rank the data points based on this score and choose the next batch for labelling. This score is either informationbased–how much new information the new point will add–or representationbased—what portion of the data distribution the point represents. In choosing the next batch, the utility scores are leveraged to choose a batch which is overall highranking, while being diverse, such that meaningful signal can be extracted from this data by the model. Simply selecting the top ranking points is typically not an effective strategy, as the highest ranking points are likely to be very similar.Here we demonstrate that data valuation using the Shapley value of data can filter out most of the unlabeled data (e.g. the bottom 90%) that is fed into diversity methods. The effect is to significantly increase their efficiency, without impacting their performance, thus allowing them to scale more easily to larger datasets. We build our utility score around a recently introduced notion in machine learning—data shapley ghorbani2019data—which attempts to compute the contributed value of each data point for a given task. The core intuition to our approach is that the incremental value of a training data point to a model varies according to the model’s performance on the task at hand. We design a new algorithm based on this notion, named Active Data Shapely (ADS), which directly optimizes batch selection to enhance the model’s final performance. As a result, ADS performs well even when learning from imperfect and noisy data, which is common in practice.
Our contributions
In this paper, we propose using data valuation to improve batch active learning and propose a new algorithm, ADS, which enhances diversitybased active learning techniques to make them more efficient, while maintaining and even improving performance. We provide comprehensive empirical results to demonstrate that ADS is a reliable method in practice. Our results consider two types of experiments. In the first, we compare diversity techniques against their ADSenhanced versions using standard curated datasets (CIFAR10 krizhevsky2009learning, CINIC10 darlow2018cinic
, Tiny ImageNet), and demonstrate that ADS increases their efficiency by about 6.4x while preserving the quality of the resultant dataset, as judged by downstream model performance. In the second, we consider more realistic active learning scenarios in which the unlabeled data comes from noisy, heterogeneous, and domainshifted distributions that are not perfectly matched with the labeled data’s distribution. In these challenging settings, we show that ADS improves both the efficiency and effectiveness of the underlying diversity method.
2 Related Work
This work is related to two families of previous efforts: active learning and data valuation.
Data Valuation
For a model trained on a set of points, data valuation refers to estimating each point’s contribution to the model’s performance. Traditional methods like cook’s distance cook1977detection and its more recent approximations like influence functions koh2017understanding define the importance of a point by the drop in the model’s performance after removing the point from training set. The Shapley value of data ghorbani2019data; jia2019towards a notion which satisfies the requisite equitability Shapley axioms shapley1953value  defines a more robust measure of a data point’s contribution. The definition of distributional Shapley value ghorbani2020distributional
, made it theoretically and empirically possible to estimate the value of new points out of the training set by interpolating the value of existing points. The major drawback with the original Data Shapley method was its computational efficiency  it did not scale to large datasets. The method introduced in
jia2019efficientmade it possible to efficiently compute the exact Data Shapley values of points in a KNN model, enabling it’s use in this work.
Active Learning
Directly solving the active learning optimization problem (see Equation 1) is computationally infeasible. Existing methods for active learning propose heuristic methods for choosing the best B points to label. The idea is that data points in the unlabeled pool will provide different amounts of performance gain. The first order approach is to assign a utility score to each point and then choose the top B points. There exists two main family of methods for finding this measure of utility. The first family of methods are focused on informativeness
. That is, how much additional information each new data point can provide. For a classification problem, the output of a model is usually a probability vector over all the classes. The simplest approaches focus on this probability vector for measure of utility. One natural approach is using the entropy as a measure of prediction uncertainty
shannon1948bell; if the model is uncertain in its prediction, the data point will provide information not already stored. Another approach is using the margin  the difference in prediction confidence between the predicted class and the second most probable class scheffer2001active; wang2011active; chen2013near. More complex methods train a number of classifiers on the unlabeled data and use the disagreement amongst them to choose the most informative points
vasisht2014active; melville2004diverse; guyon2011results (selection by committee). More recent approaches take a Bayesian approach and usually focus on approximating the parameter posterior houlsby2011bayesian.The second family of methods use representativeness. That is, choosing the subset of points that are representative examples for a large number of data points in the unlabeled pool. One natural approach is based on clustering li2012active; ienco2013clustering. The idea is that after clustering is performed, the center of the clusters will be good representative examples of the points in the cluster. There has been a line of work focused on combining the two methods du2015exploring
. For the specific case of batch active learning for deep neural networks, recent uncertaintybased methods like BatchBALD
kirsch2019batchbald have been proposed which use the mutual information between model parameters and the data points for selection. A different work ducoffe2018adversarial, approximates the distance of each point to the decision boundary (by running adversarial attacks on the model) and selects the points closest to the boundary. Representative selection methods such as Coreset selection sener2017active methods, seek to select the most representative subset of data points. Both families of methods can suffer from the diversity problem in the large batch setting: Selecting the most uncertain points to label can result in choosing similar points close to the decision boundary and selecting the most representative points can result in only sampling similar points from a high density area of the distribution. As a result, batch active learning methods that seek to sample a diverse batch of points have been developed zhdanov2019diverse; wei2015submodularity; hoi2006batch. All of the existing methods, however, use measures like uncertainty as a proxy for estimating how much a point can improve the trained classifier’s performance. Unlike existing methods that use informativeness or representativeness as surrogate measures of a point’s expected contribution to a trained model, we estimate the Data Shapley value of points. The Data Shapley value of a point, by definition, is a measure of its expected contribution to a classifier’s performance.3 Methodology
Notation
We focus on classification problems where the prediction’s output space is . Let and annotate the true distribution of training and test data points over , where is the space of data points. We are given a set of samples from the training data distribution which we refer to as the training set: . For simplicity, we use the data points and their index interchangeably, e.g. for the training set. In this work, we specifically think about subsets of the training data, that is, where . A learning algorithm is a black box that takes in a subset of data and outputs a predictive model. In our work,
is equivalent to training a deep convolutional neural network using stochastic gradient descent. Active learning optimizes over a chosen performance metric
(e.g. 01 loss). For simplicity, we use to refer to the test performance of a model trained on subset , and define . There exists a labelling function that returns each example’s true label: .Active Learning
Improving an existing model’s performance requires new labeled training data. Active learning seeks to find and label the smallest number of points to achieve the greatest performance improvement. The active learning setup is as follows: We begin with a given data subset , of labeled data points, which we call the “initial pool”. At each step of the algorithm, we have a labelling budget (the number of times we can query the labelling function) to label new data points. The first step can be formally written as the following optimization problem:
(1) 
The goal here is to label a size subset of the unlabeled data that results in the maximum increase in the trained model’s performance on the test data. The goal at the ’th step is similar. The active learning algorithm uses the previous batches of labeled data to select new points for labelling from the remaining unlabeled data points. True test performance is approximated using a subset of data points from the test distribution . From the equation, it’s clear that the problem of batch active learning is difficult to solve on two fronts: (1) We have to estimate the the resulting performance without having the labels. (2) The search space is combinatorially large.
Data Valuation
Data valuation seeks to assign value to individual training datum. Given a model trained on the training dataset and its test performance , valuation seeks to find how much each data point contributed to this performance. In other words, how to divide among individual datum; for each data point , we find to find such that . Introduced in ghorbani2019data, Data Shapley provides a solution that satisfies the Shapley equitability axioms:

Null element If results in zero change in performance if added to any subset of , then it should be given zero value.

Symmetry If two distinct data points and results in the same change in performance if added to any subset of , they should have equal value.

Linearity If the performance metric is a linear combination of individual metrics, the value should follow the same linearity.
The Data Shapley value uniquely satisfies the above axioms. For a data point , let be its data Shapley value. We have:
(2) 
That is, the data shapley value of a datum is a weighted average of its marginal contribution if we add it to all possible subsets of training data. The weight corresponds to the number of subsets that have the same size. It is a measure of the utility of each datum. Naturally, a data point with highly positive Shapley value is one that contributes positively to most subsets of the data. A data point with negative Shapley value, on the other hand, is one that on average hurts the performance of the model. The above definition, however, is restricted to the case where the training dataset is fixed. In machine learning, we usually assume the training data is an i.i.d realization of the underlying data distribution. Assuming the data is coming from the underlying distribution , one can extend the idea of Data Shapley value to Distributional Shapley value ghorbani2020distributional. It is proven theoretically and empirically that such a value is Lipschitz continuous for a large group of learning algorithms. More specifically, if and are similar data points (in a proper metric space), their values will be similar. This is very useful in practice; if we have the Shapley values for a subset of data points , we can estimate the value of the remaining data points by interpolating existing values. In practice, once we have , we can train a regression model to predict the value of a new data point given its covariates and label. This is important in our setting as we can only compute the data Shapley values for the labeled pool. We can then use these values to train a regression model that predicts an unseen point’s value given its covariates. Applying the model to the unlabeled pool, we will have an estimation of the value of each unlabeled point.
Batch Active Learning with Shapley Data Valuation
As mentioned above, batch active learning seeks to find the subset of points that result in the largest improvement in performance. From Eq. 2, it can be observed that the Data Shapley value of a point is a robust indicator of its improvement to model performance, since it’s an exhaustive average of the point’s behavior. Assume we are given the true Data Shapley value of all the unlabeled data points. One way of approaching the optimization problem in Eq. 1 is to reduce the search space by searching over a subset of that has high value. Fig. 3
shows a graphical example of active learning and how this can help. Here, we have a binary classification problem with a logistic regression model in which the class
red is a mixture of two distributions: a majority group and a minority group (Fig. 3(a)). The points chosen by two simple active learning algorithms, one that works based on uncertainty and another one that works based on representativeness are shown(Fig. 3(cd)). It can be seen that using either the uncertaintybased or representativeness method will result in poor performance, since performance in both cases is hindered by the minority group in the red class. Points in this minority cluster have low data Shapley values because their presence actually hurts the accuracy of a simple linear classifier (Fig. 3(e)). Removing the low value points from the search space first and then sampling a representative set of examples greatly improves the performance (Fig. 3(f)).Approximating Data Shapley Values
From Eq. 2 we can see that the exact computation of Data Shapley values is infeasible in most realistic scenarios. One way to approximate them is by the TMCShapley algorithm introduced in the original Data Shapley paper ghorbani2019data; jia2019towards. The idea is that an equivalent form of Eq. 1 is:
(3) 
where is the set of all permutations of and is the set of points that appear before in permutation . Using this form, one can use MonteCarlo sampling to estimate as follows: sample a permutation of , go over the sampled permutation one point at a time, train a new model on all the observed points, and truncate once adding a new point results in a small marginal change. This results in one MonteCarlo sample for the value of every data point in . Iterating on this process, you can have an arbitrarily accurate approximation of data Shapley values.
Although the TMCShapley algorithm has been shown to be efficient for simple models ghorbani2019data; jia2019towards; ghorbani2020distributional, for a deep neural network it’s infeasible to implement. The reason being that it requires retraining the same model on the order of times which is impractical in the case of deep neural networks. Luckily, existing work has dealt with this problem jia2019empirical
– using a KNearestNeighbor model, exact Shapley values can be computed efficiently using dynamic programming. The idea can be utilized in deep learning as follows. We focus on the value of points not in their original space (e.g. pixels), but in a learned representation space. We use the prelogit layer of a trained deep neural network as a good representation space for the classification problem at hand. In this space, training a simple model (like a KNNclassifier) will result in comparable accuracy to that of the original model. We then compute the Data Shapley values using the mentioned method. Note that, by doing so, we ignore the data point’s contribution to the representation learning part of the model and only focus on its contribution to the prediction. In this work, we assume a KNNclassifier on top of the learned representation for two reasons: (1) In practice, applying a KNN model on top of a deep network’s prelogit layer achieves similar accuracy to the model’s accuracy. (2) We can use the efficient method introduced in
jia2019efficient which computes exact Shapley values in linear time.Active Data Shapley algorithm
Looking at Eq. 2, there is an imminent problem: we don’t have the label for points other than the ones in the initial labeled pool. As mentioned above, we can use the Lipschitz continuity property of data values. We compute the Data Shapley value of points in the labeled pool and train a regression model to predict the value of points in the unlabeled pool. Fig. 2 visually describes the algorithm. First, we train the deep learning model using the labeled pool, yielding a good representation extractor. We then pass all of the labeled data points through the model to extract their representations, and apply the KNNShapley algorithm to compute their exact values. For a problem with classes, we use regression models (e.g. KNN regression) that use labeled data points from each class to predict Shapley values of unlabeled data points. That is, for each unlabeled point , we predict the Shapley value for for each possible choice of , yielding data values. Now the remaining step is to find an aggregate Shapley value for . Aggregate approaching include taking the average, or a weighted average using the model’s prediction probability for each class. In practice, we find that taking the optimistic approach of using the max value results in the best performance. That is, . When is large, this step can be computationally expensive. This is solved by limiting the number of possible classes for each unlabeled point. That is, we only consider values of for which the model’s confidence is large enough (top 10 in our experiments).
Batch diversity
Choosing the highest scoring points to form a batch can result in a batch of points that are not diverse enough. Consider an adversarial scenario in which each data point is repeated many times in the unlabeled pool. Here, choosing the highest scoring points would result in choosing repetitions of the high value examples. To tackle this issue, one can preselect a larger set of points based on their predicted Shapley value and then choose a diverse subset from this subset. To this end, we first preselect a larger number of data points (210 times depending on the size of unlabeled pool) and then use a diversity selection algorithm (e.g. coreset) sener2017active to select a diverse set of points.
4 Experiments
Following previous work sener2017active; ash2019deep, we focus on the task of image classification and consider both the efficiency and performance effectiveness of ADSenhanced active learning. In all of our experiments, we use a WideResNet model (168 for character classification tasks and 2410 for all other experiments). We assume 5,000 images of the dataset is given as the initial labeled pool, and at each iteration, we label a new batch of 5,000. We first report efficiency results for the first iteration of the active learning problem using several diversity algorithms: (1) Coreset selection sener2017active, using the greedy version of the algorithm for computational efficiency. (2) KMedians (3) Batch Active Learning by Diverse Gradient Embeddings (BADGE) ash2019deep Due to computational constraints, for the iterative selection process experiments, we focus on three active leanring algorithms: (1) Coreset selection, (2) Entropy (highentropy data points are chosen) (3) Random, in which examples are selected randomly.
Following gal2017deep, at each iteration of the active learning acquisition, we retrain the model from scratch as it helps with the performance. In all experiments, a small set of 500 test examples are used as our validation set for early stopping. The same validation set is used for approximating the Data Shapley values. Results are reported on the rest of the test set.






KMedian  63.4  94.9  64.5  34.2  85.4  
KMedians+ADS  63.3  95.5  65.2  34.1  85.4  
Coreset  63.1  93.1  64.7  34.3  86.2  
Coreset+ADS  63.4  93.4  64.7  34.4  86.4  
BADGE  63.9  95  70.4    87.8  
BADGE+ADS  64.3  95.4  70.4    87.2 
4.1 Active Learning Efficiency
In Fig. 4, we compare the timeefficiency of Coreset, BADGE, and KMedians to their ADSenhanced versions. These methods scale linearly with the size of the unlabeled pool. ADSenhanced diversity applies the diversity method on a preselected subset of high value unlabeled data points, much smaller than the original unlabeled pool. Thus, it can linearly decrease the computational cost. For instance, if ADS removes 90% of the unlabeled pool in the preselection step, then it will be about 10 times faster since the computational cost of KNNShapley algorithm is small. The larger the unlabeled pool, the greater the efficiency gain. In all cases considered, ADSenhancement yields an efficiency gain factor of 2.68x. Note that the reported times for ADS include the time it takes to regress and predict the Shapley values. Table 1 shows that despite the speedgain, adding ADS to the active learning pipeline can improve the overall performance.
4.2 Active Learning Effectiveness on Curated Data
We consider the effect of ADSenhancement labeling both curated and noisy data. In the curated case, we use standard baseline datasets in which the unlabeled pool of data and the test set both come from the same distribution. All methods on curated data converge to similar performance levels. Here, ADSenhanced methods either match or slightly outperform their counterparts, as described below and illustrated in Fig. 5(left column).
We use three benchmark datasets to show variations of the problem. The first is CIFAR10
krizhevsky2009learning, containing 50,000 tiny () colored images of 10 objects. The second is CINIC10 darlow2018cinic, an interesting variant of CIFAR10 that contains the same 10 classes as CIFAR10 but comes from two different sources. 50,000 of its images come from CIFAR10 images and 200,000 images come from ImageNet(ILSVRC2012) russakovsky2015imagenet, from the same 10 classes. It is chosen in order to better understand how active learning methods work when the unlabeled pool is much larger and the data is more diverse. The third is Tiny ImageNet, chosen to investigate the scenario of having a larger number of classes. Tiny ImageNet has 100,000 images from 200 different object classes. We set the preselection threshold according to the size of the unlabeled pool. For CIFAR10, since the pool is small, we preselect of the points with the highest Shapley value and then apply Coreset at each iteration of batch active learning. For CINIC10 and Tiny ImageNet, we apply Coreset after preselecting of the unlabeled points with the highest value. Note that this could be replicated with any other diversity method.Experimental results are shown in Fig. 5(left column), where we compare performance across the various methods. We focus on the first five iterations of each algorithm for clarity. It can be observed that in all three datasets, ADSenhanced coreset essentially matches the performance of Coreset alone, and both techniques outperform the two baselines. This shows the ADS can enhance a diversity method without its data reduction step negatively impacting model performance on curated data.
4.3 Active Learning Effectiveness on Noisy Data
The assumption of an unlabeled pool of data points that come from the exact distribution of the labeled pool is unrealistic in realworld setups. Often the unlabeled data is not clean; images can be corrupted by various confounders like noise, distortion, etc. In other cases the unlabeled data comes from various sources, and it is unknown how closely each data source matches the test data distribution. Further, the unlabeled data may be collected quickly, and without expensive curation. We can leverage value from a cheap and low quality unlabeled pool of data points by intelligently selecting its best data points. In the following set of experiments, we simulate such realworld scenarios to evaluate the different learning algorithms.
As in the curated data experiments, we use Coreset as the diversity method, and work with both CIFAR10 and CINIC10. Additionally, we work with the street view house numbers (SVHN) dataset, which contains more than 70,000 colored images for the task of digit classification. It contains an extra set of over 500,000 images that we use to mimic the realistic scenario of having a large unlabeled pool.
Further, we create a webscraped dataset – Cheap10 – designed to investigate the realworld setting of gathering a large pool of unlabeled data points quickly and inexpensively. To this end, we use the Bing image search engine and search the title of each class in CIFAR10 with reasonable keywords e.g. “convertible car” to scrape the web. We create a data set of 500,000 images (ten times the size of CIFAR10) in just a few hours of effort. The data set, while containing many valid images, contains lots of out of distribution examples, noisy examples, and mislabeled examples. Given the large size of the unlabled pool in both datasets, we preselect only of the unlabeled pool using the estimated Shapley values before choosing a batch of using coreset selection.
Using the above three datasets, we run three experiments to examine different aspects of noisy data scenarios. The results are shown in Fig. 5(right column).
The first experiment investigates domain shift. CINIC10 serves as the unlabeled pool of data, while performance is measured on CIFAR10. CINIC10 is a mix of CIFAR10 and ImageNet, thus one can assume that some of the ImageNet images are indomain while others are not (e.g. a panther image for the cat class). We can see that ADSenhanced active learning achieves top performance over baselines, since it can find the points that are most helpful in classifying CIFAR10 images.
The second experiment models the case in which the given unlabeled data is partially corrupted. We use SVHN’s extra training set as the unlabeled pool and corrupt
of the images with white noise. The power of the noise is randomly sampled from a Beta distribution to simulate the realworld scenario of having images of varied quality. All active learning methods are better than random selection, and ADSenhanced active learning achieved the best performance.
Our final experiment represents a common realworld scenario: gathering curated data is expensive and timeconsuming, whereas lowquality unlabeled data is abundantly available. The Cheap10 dataset contains CIFAR10 classes, but is corrupted by a large number of out of task and out of distribution images (e.g. a car brand logo for the car class). Here, ADSenhancement significantly outperforms the other active learning methods, though random selection also performed surprisingly well.
This set of experiments show that ADS provides a degree of robustness to realworld distributional shifts during active learning – matching and even enhancing the performance of diversity methods. Considering both the curated data and noisy data sets of experiments, we conclude that using the Shapley Value of data as a metric helps to find the best images for the intended task at hand, while significantly boosting the efficiency of active learning.
5 Discussion
We propose Active Data Shapley as a new enhancement step for active learning. This is of particular value in the context of deep learning, which needs very large datasets. Historically, active learning methods have chosen the next best data point, from an unlabeled pool, to be labeled. In the context of deep learning, batch active learning is employed as a more scalable method, in which the next best batch of data is selected from an unlabeled pool. Standard batch active learning methods assign a utility score to unlabeled data points and label a group of highly scored points. Unlike prior methods, ADS takes a new approach for computing the utility scores that is built on the notion of the Shapley value of data. Data Shapley value is a direct measure of how much a point will help the prediction accuracy of an existing model if that point joins the existing training data points. Through extensive experimentation, we show that ADS enhances existing active learning techniques by making them more efficient and more robust to noisy unlabeled data. We perform experiments that model realworld scenarios in which unlabeled data points are intentionally misaligned with the test distribution, and show that our method significantly enhances existing techniques. Of note, ADS is a general framework which can be extended to any other learning model, and future research directions may apply ADS beyond deep learning. Additionally, we note that the ADS pipeline depends on the quality of the value estimation; if we are unable to properly estimate unlabeled points’ values, the method’s performance will suffer.
Comments
There are no comments yet.