Preventing Adversarial Use of Datasets through Fair Core-Set Construction

10/24/2019 ∙ by Benjamin Spector, et al. ∙ 0

We propose improving the privacy properties of a dataset by publishing only a strategically chosen "core-set" of the data containing a subset of the instances. The core-set allows strong performance on primary tasks, but forces poor performance on unwanted tasks. We give methods for both linear models and neural networks and demonstrate their efficacy on data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of deep learning has led the field of computer science to a difficult crossroads. On the one hand, in pursuit of performance state-of-the-art methods require ever-larger mountains of data. On the other, users have become more conscious and protective of their data. Much of this concern is over fear of what could be done with data: most are in favor of more relevant search results and recommendations, but many would be worried about malicious use of their personal data.

Most current works take approaches related to federated learning or differential privacy (Dwork, 2011; Bonawitz et al., 2019; Geyer et al., 2017). In this work, we take an alternative approach, most similar to Schmidt et al. (2018), by reducing a dataset to a much smaller subset which is most applicable to the task at hand and least applicable to other tasks.

Our framework operates as follows. First we train one or more models for the entire dataset; we treat these models as approximations of the target function, constrained to lie within the model family. Second, we phrase an optimization problem to select a core-set that supports reconstruction of the approximate target while simultaneously optimizing for desirable secondary criteria. In this abstract, we focus on the secondary criterion of preserving privacy of specified target attributes, but this applies more broadly.

The benefits of a core-set approach to privacy are threefold. First, by storing much smaller amounts of data, the potential downside of the training data being compromised is significantly lessened. (Incidentally, this also has many auxiliary advantages, such as lowered costs and footprint for data storage, and more rapid training.) Second, it is also possible to construct core-sets that increase the difficulty of solving problems other than the specific problem of interest, lessening privacy concerns. Third, it is simple and practical, because all it requires is deletion of certain data elements.

2 Background

Core-sets are popular in classical machine learning and data mining, typically for increasing the speed of algorithms.

(Har-Peled & Kushal, 2007; Tsang et al., 2005) However, they remain relatively unexplored for both neural networks and other applications.

Given a training set of examples and corresponding of labels, we train a model to predict on unseen . We are provided a model evaluator (typically based on a testing set) that maps a model to a scalar performance. The core-set problem is to find indices , , inducing subset of the examples and of the labels that maximize . Analogously, we define the -core-set problem as the same problem with . Throughout the paper, will denote the core-set size.

3 Methods

One might reasonably wonder whether core-sets could ever perform significantly better than random samples. As a trivial example, consider the problem of estimating the mean and variance of a normal distribution


samples. As the samples are unbiased, our best-estimate mean should be the mean of the samples, an unbiased estimator with expected squared error of

. As grows very large we will have an excellent estimate of . Likewise, standard techniques allow us to estimate the variance of the underlying distribution. Given samples, we may for example select a core-set of elements to preserve the sample mean, outperforming a random sample. We may also in this core-set carefully select points with greater dispersion than random, obscuring the variance of the distribution. Such a core-set would be appropriate to publish if the goal is to allow learning of the mean, while protecting the variance.

Thus, by condensing the greater information of the full sample into a small number of examples through careful choice, we may produce a (non-representative) sample that nonetheless achieves superior performance on our evaluation metric. Furthermore, given this great freedom, we may also select unrepresentative samples that disguise other information about the true distribution.

What enabled us to successfully condense and disguise our dataset was our knowledge of the learning method. Our algorithm did not require the sample variance; therefore we did not sacrifice performance by altering it. One could imagine alternate approaches that make use of the sample variance, such that obscuring it might damage performance on the task of mean estimation. While we cannot cover this in more detail here, as in other areas of privacy, privacy-preserving core-sets require striking a balance between accuracy and privacy.

4 A synthetic case study using linear regression

Suppose we have a dataset , target values which we would like to predict, and targets which we would like to hide. We propose to build a loss that encourages fidelity in reconstructing while obscuring . Specifically, we use the following scheme for core-set construction, following our framework. First, train on the entire dataset, and record the weights and the intercept . Then, define a loss for a point as , where . The first term of the loss favors points that align with the learned model for from the full data. The second term, not yet specified, should make it hard to recover . To construct a -core-set, simply choose the points with lowest loss.

For the term , many approaches are possible, but we consider two simple approaches:

  • Hide the secret value. In this approach, we take . This is a simple way to make difficult to predict, by choosing samples that are all roughly the same. However, this method might select a core-set that is artificially uniform in value, perhaps selecting almost all points from a majority class.

  • Plant a synthetic secret value. In this approach, we take

    for some random vector

    and intercept . As the core-set is a subset of data points with their labels, we cannot modify the labels of points. Instead, we favor selection of points whose noise in the direction will mislead a learner. This will encourage simple learners to discover the planted value rather than the original value.

In this short paper, we don’t argue that these two simple approaches are safe against targeted attacks; in fact, we expect they are not. However, we can analyze their effect on standard training algorithms. Let with and labels and , where each . The task is to recover the line of best fit. As seen in Table 1, for a full dataset of size and , our method works well, preserving prediction of while masking the true relationship between and .

True values
Full dataset
() () ()
() () ()
Table 1: Linear regression performance for each task, training on both the full dataset as well as the core-set. The random regression used was . The key result is that the full dataset is predictive for both problems, but the core-set is only predictive for .
Figure 1: Visualization of the datasets (green/red) and core-sets (blue) for each feature and label from Table 1.

5 Core-sets for CIFAR-100

The synthetic example above shows that core-sets can hide properties of the original dataset, but the example is stylized, and the and values are chosen independently. We now consider a real dataset, and attempt to obscure one set of labels that is highly correlated with another.

CIFAR-100 comes labeled both with coarse and fine-grained labels. So, we will try to select a subset of data such that performance on identifying coarse-grained classes is maximized but performance on fine-grained classification is minimized. Additionally, to ensure that classes are not simply dropped from the dataset, we also constrain the dataset to be class-balanced under the fine-grained labels.

Again following our framework, we train a model on the full dataset. We collect for each example the average gradient magnitude over training as a proxy for its difficulty, with the idea being that difficult, unrepresentative or uncommon examples will yield lower performance for small than easy examples. (Evidence for this method is presented in Figure 2.) So, we will choose examples that produce good core-sets for coarse-grained classification but generalize poorly on the individual classes.

Figure 2: CNN CIFAR-100 test accuracy trained on class-balanced core-sets versus the average gradient norm of the examples used. Core-sets made of smaller-normed examples do better.

We trained a wide-resnet-16-8 on CIFAR-100, and calculated the average gradient norm of each example on each task. We then constructed several core-sets for evaluation. The fine-grained-masking core-set is constructed by sorting and class-balancing data by the quotient of an example’s average fine-grained gradient norm and its coarse-grained norm, and the coarse-grained-masking core-set by the inverse of that quotient. For completeness, we list results using a random core-set and a high-quality core-set for each task using the minimum gradient norm examples. Final accuracy were obtained using a small CNN due to compute limitations.

As seen in Table 2, we found that we were successful in constructing core-sets which would maximize performance for one task and minimize for the other.

Core-set Coarse-grained Fine-grained
() classification (%) classification (%)
Random 36.35 22.83
Fine-grained min-norm 42.86 30.79
Coarse-grained min-norm 43.10 27.15
Fine-grained-masking 39.55 16.40
Coarse-grained-masking 33.53 26.32
Table 2:

Performance of CNN classifier on various core-sets. The bottom two rows evidence that even on these related tasks our method can construct core-sets which significantly favor one task over the other.

6 Discussion

The main implication of this work is that one need not alter or synthesize data nor come up with alternate embeddings for data as in Zemel et al. (2013); Wang et al. (2018). Rather, by carefully selecting an appropriate subset of data for release, one can sufficiently alter the statistics of relevant properties so as to obscure protected values. In this extended abstract, we introduce these ideas without discussing specifics of attacks and counter-attacks, as the start of a conversation about core-set-based privacy.

In the linear regression domain, our method successfully hides protected functions from direct likelihood-maximizing approaches on synthetic data. We would like to point out some nice additional properties of our formulation. First, as the dimensionality of the input data increases, so does the probability that the random direction is approximately orthogonal to the true relationship. Second, due to the presence of the parameter

in the loss, one may easily tune the focus on performance versus privacy. Third, the factor of dataset reduction in order to construct a core-set depends on the magnitude of the unexplained variance—for very low unexplained variance, one needs many samples just to find a few unrepresentative examples. In more difficult, real-world problems though, this should actually be even easier. One potential vulnerability is that if an adversary has good priors on the distribution of input features, they may be able to gain information through the difference in feature distribution in the core-set. In the future, we would like to address these problems and experiment with regressions on real-world data.

Additionally, it should be noted that the CIFAR-100 task we chose is actually an especially difficult problem for this approach. First, the size of the dataset is only several times larger than the amount of data required to reasonably consistently train a CNN from scratch, which limits choice of unrepresentative samples. Second, the tasks of predicting class versus super-class are very similar, which makes it surprising that it is even possible to find core-sets which are significantly more predictive on one task than the other. It will be interesting to try these ideas on other datasets such as the ImageNet.

7 Conclusions

In this work we have presented simple and practical methods for constructing fair core-sets which better preserve privacy. We demonstrate their utility on both synthetic and real-world datasets, with both linear and neural models. We hope that this work will pave the way for smaller and more private datasets in the future.


We’d like to thank Tushar Chandra for productive conversation related to this work, and Asher Spector for advice, comments, and revisions on this paper.