privacyevaluator
The privML Privacy Evaluator is a tool that assesses ML model's levels of privacy by running different attacks on it.
view repo
Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, without a confidence measure. In this paper, we introduce labelonly membership inference attacks. Instead of relying on confidence scores, our attacks evaluate the robustness of a model's predicted labels under perturbations to obtain a finegrained membership signal. These perturbations include common data augmentations or adversarial examples. We empirically show that our labelonly membership inference attacks perform on par with prior attacks that required access to model confidences. We further demonstrate that labelonly attacks break multiple defenses against membership inference attacks that (implicitly or explicitly) rely on a phenomenon we call confidence masking. These defenses modify a model's confidence scores in order to thwart attacks, but leave the model's predicted labels unchanged. Our labelonly attacks demonstrate that confidencemasking is not a viable defense strategy against membership inference. Finally, we investigate worstcase labelonly attacks, that infer membership for a small number of outlier data points. We show that labelonly attacks also match confidencebased attacks in this setting. We find that training models with differential privacy and (strong) L2 regularization are the only known defense strategies that successfully prevents all attacks. This remains true even when the differential privacy budget is too high to offer meaningful provable guarantees.
READ FULL TEXT VIEW PDFThe privML Privacy Evaluator is a tool that assesses ML model's levels of privacy by running different attacks on it.
Machine learning algorithms are often trained on sensitive or private user information, such as medical records [24, 49], textual conversations [12, 11], or financial information [35, 42]. These trained models can inadvertently leak information about their training data [44, 34, 5]—thereby violating users’ privacy.
In perhaps the simplest form of information leakage, membership inference [44] attacks allow an adversary to determine whether or not a data point was used in the training data. Revealing just this information can cause harm, as it leaks information pertaining specifically to members of the model’s training data, rather than about the user population as a whole. For example, suppose a model is trained to learn the link between a cancer patient’s morphological data and their reaction to some drug. An adversary in possession of a victim’s morphological data and with query access to the trained model cannot directly infer whether the victim has cancer. However, inferring that the victim’s data was part of the model’s training set reveals that the victim indeed has cancer.
Existing membership inference attacks exploit the higher prediction confidence that models exhibit on their training data [38, 58, 54, 20, 41, 44]. This difference in prediction confidence is largely attributed to overfitting [44, 58]. In these attacks, the adversary queries the model on a target data point to obtain the model’s confidence and infers the target’s membership in the training set based on a decision rule.
A large body of work has been devoted to understanding and mitigating membership inference leakage in ML models. Existing defense strategies fall into two broad categories:
Defenses in the first category either use regularization techniques (e.g., dropout, weight decay or earlystopping) developed by the ML community to reduce overfitting, or simply increase the amount of training data [54, 41, 44]. In contrast, the second category of adversaryaware defenses explicitly aim to minimize the membership inference leakage as computed by a particular attack [23, 33, 57]. Existing defenses in this category alter a model’s outputs, either through a modification of the training procedure (e.g, the addition of a loss penalty) or of the inference procedure posthoc to training (e.g., to flatten returned confidence scores on members).
In this paper, we introduce labelonly membership inference attacks. Our threat model makes fewer assumptions compared to prior attacks, in that the adversary can only obtain (hard) labels when querying the trained model, without any prediction confidences. This threat model is more realistic in practice—as many machine learning models deployed in userfacing products are unlikely to expose raw confidence scores.
In the labelonly setting, a naive baseline strategy predicts that a target point is a member of the training set when the model’s prediction is correct. We show that even this simple attack matches the best confidencebased attacks in some settings. In order to design labelonly attacks that perform better than this baseline, we will necessarily have to make multiple queries to the target model. We show how to extract finegrained membership leakage by analyzing a model’s robustness to perturbations of the target data, which reveals signatures of the model’s decision boundary geometry. Our adversary queries the model for predictions on augmentations of data points (e.g., rotations and translations in the vision domain) as well as adversarial examples.
In an extensive evaluation we show that our attacks match the performance of confidencebased attacks (see Section VIA). We further show that our attacks naturally break existing defenses that fall into category (2) discussed above. These defenses either implicitly or explicitly rely on a strategy that we call confidence masking.^{1}^{1}1Similar to gradient masking from the adversarial examples literature [36]. This strategy consists of masking the membership inference leakage signal contained in the model’s confidence scores, thereby thwarting existing attacks. However, the (hard) labels predicted by the model remain largely unaffected, which explains why such defenses have little to no effect against our labelonly attacks. Put differently, confidence masking does not address the inherent privacy leakage that stems from the model being overfit on the training data. This allows us to break two stateoftheart defenses: MemGuard [23] and adversarial regularization [33]. While these defenses can successfully reduce the accuracy of existing (confidencebased) membership inference attacks to within percentage points of random chance, they have a negligible effect on the success rate of our attacks.
Overall, our evaluation demonstrates that the use of confidence values in membership inference attacks is unnecessary. Existing attacks either do not outperform the naive baseline, or when they do, their performance can be matched by attacks that only rely on the model’s predicted label.
Finally, we argue that successful membership inference defenses should not only protect the privacy of the average user, but of the worstcase outlier user. We find that for some models with low averagecase membership leakage, the membership of users in the tails of the distribution can still be inferred with high precision—even with labelonly attacks. Models trained with differentialprivacy guarantees [1, 7, 56, 3] appear to effectively minimize the amount of membership leakage for all users, even when the formal privacy bounds are close to meaningless (i.e., differential privacy for ).
We make the following contributions:
We introduce the first labelonly attacks, leveraging data augmentations and adversarial examples.
We show that confidence masking is not a viable defense to privacy leakage, by breaking two canonical defenses that use it—MemGuard and Adversarial Regularization—with our attacks.
We evaluate two additional techniques to reducing overfitting and find that training with data augmentations can worsen membership inference leakage while transfer learning can mitigate this leakage.
We introduce “outlier membership inference”: a stronger property that defenses should satisfy; at present, differentially private training and (strong) L2 regularization appear to be the only effective defenses.
We will release code to reproduce all our experiments.
We consider supervised classification tasks [32, 43], wherein a model is trained to predict some class label , given input data . Commonly, may be an image or a sentence and would be the corresponding label, for instance, a digit 09, an object type, or a text sentiment.
We focus our study on neural networks
[2]: functions composed as a series of lineartransformation layers, each followed by a nonlinear activation. The overall layer structure is called the model’s
architecture and the learnable parameters of the linear transformations are the weights. For a classification problem with classes, the last layer of a neural network outputs a vector v ofvalues (often called logits). The
softmax function is typically used to convert the logits into normalized confidence scores:^{2}^{2}2While it is common to refer to the output of a softmax as a “probability vector” because its components are in the range
, we refrain from using this terminology given that the scores output by a softmax cannot be rigorously interpreted as probabilities [18]. . For a model , we define the model’s output as the vector of softmax values. The model’s predicted label is the class with highest confidence, i.e., .Augmenting data aims to improve the generalization of a classifier
[9, 46, 52]. Data augmentations are commonly used on stateoftheartmodels [21, 9, 37] to create new and larger datasets to learn from, without the need to acquire more labeled data samples (in a costly process). Augmentations are especially important in lowdata regimes [40, 14, 10]. Augmentations are domainspecific: they apply to a certain type of input, (e.g., images or text).We focus on image classifiers, where the main types of augmentations are affine transformations (rotations, reflections, scaling, and shifts), contrast adjustments, cutout [13], and blurring (adding noise). By synthesizing a new data sample as an augmentation of an existing data sample, , the model can learn a more semanticallymeaningful set of features. Data augmentations can potentially teach the machine learning model to become invariant to the augmentation (e.g., rotationally or translationally invariant).
Transfer learning is a common technique used to improve generalization in lowdata regimes [51]. By leveraging data from a source task, it is possible to transfer knowledge to a target task. A common approach for transfer learning is to train a model on the data of the source task, and then finetune this model on data from the output task. In the case of neural networks, it is common to finetune either the entire model, or just the last layer.
Membership inference attacks [44] are a form of privacy leakage that identifies if a given data sample was part of a machine learning model’s training dataset.
Given an example and access to a trained model , the adversary uses a classifier or decision rule to compute a membership prediction , with the goal that whenever is a training point. The main challenge in mounting a membership inference attack is creating the classifier , under various assumptions about the adversary’s knowledge of and of its training data distribution.
Prior work assumes that an adversary has only blackbox access to the trained model , via a query interface that on input returns part or all of the confidence vector .
The original membership inference attack of Shokri et al. [44] create a membership classifier by first training a number of local “shadow” models (we will also refer to these as source models). Assuming that the adversary has access to data from the same (or a similar) distribution as ’s training data, the adversary first locally trains a number of auxiliary classifiers on subsets of the data. Since these shadow models are trained by the adversary, their training set and by extension, the membership of any data point in these training sets is known. The adversary can thus construct a dataset of confidence vectors with an associated membership label . Finally, the adversary trains a classifier to predict given . To apply the attack, the adversary queries the targeted model to obtain the confidence vector , and then uses its trained classifier to predict the membership of in ’s training data.
Salem et al. [41] later showed this attack strategy could be successful even without access to data from the same distribution as , and only to data from a similar task (e.g., a different vision task). They also demonstrated that training shadow models is unnecessary: applying a simple threshold on the targeted model’s confidence scores suffices. That is, the adversary predicts that is in ’s training data if the prediction confidence is above a tuned threshold.
Yeom et al. [58] propose a simple baseline attack: the adversary predicts a data point as being a member of the training set when classifies correctly. The accuracy of this baseline attack directly reflects the gap in the model’s train and test accuracy: if overfits on its training data and obtains much higher accuracy on its training data, this baseline attack will achieve nontrivial membership inference. We call this the gap attack. If the adversary’s target points are equally likely to be members or nonmembers of the training set (for more on this, see Section VA), this attack achieves an accuracy of
(1) 
where are the target model’s accuracy on training data and held out data respectively.
To the best of our knowledge, this is the only attack proposed in prior work that makes use of only the model’s predicted label, . Our goal is to investigate how this simple baseline can be surpassed to achieve labelonly membership inference attacks that perform on par with attacks that use access to the model’s confidence scores.
The work of Long et al. [29] investigates the possibility of membership inference through indirect access, wherein the adversary can only query on inputs that are related to , but not directly. The labelonly attacks we present in this paper similarly make use of information gleaned from querying the model on data points related to (specifically, various perturbed versions of ).
The main difference is that we focus on labelonly attacks, whereas the work of Long et al. [29] assumes adversarial access to the model’s confidence scores. Our attacks will also be allowed to query at the chosen point , but again only to obtain the model’s predicted label.
Defenses against membership inference broadly fall into two categories.
First, Shokri et al. [44] demonstrated that overfitting plays a role in their attack’s success rate. Thus, standard regularization techniques such as L2 weight normalization [44, 23, 54, 33], dropout [23], or differential privacy have been proposed to defend against membership inference. Heavy regularization has been shown to limit overfitting and to effectively defend against membership inference, but may result in a significant degradation in the model’s accuracy. Moreover, Yeom et al. [58] show that overfitting is sufficient, but not necessary, for membership inference to be possible.
Second, a variety of techniques have been suggested for reducing the information contained in a model’s confidence scores. These include truncating confidence scores to a lower precision [44], reducing the dimensionality of the confidence score vector [44, 54] (e.g., only returning the top k scores), or perturbing confidences via an adversaryaware “minimax” approach [33, 57, 23]. These later defenses modify either the model’s training or inference procedure so that the model produces minimally perturbed confidence vectors that thwart existing membership inference attacks. We refer to defenses in this second category as “confidencemasking” defenses.
Most membership inference research is focused on protecting the averagecase user’s privacy: the success of a membership inference attack is evaluated over a large dataset. Long et al. [29] focus on understanding the vulnerability of outliers to membership inference. They show that some outlier data points can be targeted and have their membership inferred to high precision ( outliers at up to 90% precision) [28, 29]. Recent work analyzes membership inference from the defender’s perspective, that is in a whitebox setting with complete access to the model, to understand how overfitting impacts membership leakage [27].
Query Interface  Attack Feature  Knowledge  Source 

confidence vector  train, data, label  [44]  
confidence vector  train, data  [28]  
confidence vector  –  [41]  
confidence vector  –  [41]  
confidence vector  –  [58]  
labelonly  label  [58]  
labelonly  train, data, label  ours  
labelonly  train, data, label  ours 
is the model’s loss function,
is a dataaugmentation of (e.g., image translation), and is the distance from to the decision boundary. Train, data and label knowledge mean, respectively, that the adversary (1) knows the model’s architecture and training algorithm, (2) has access to samples from the training distribution, and (3) knows the true label of the examples being queried.The goal of a membership inference attack is to determine whether or not a candidate data point was used to train a given model. In Table I, we summarize different sets of assumptions made in prior work about the adversary’s knowledge and query access to the model.
The membership inference threat model originally introduced by Shokri et al. [44], and used in many subsequent works, assumes that the adversary has blackbox access to the model (i.e., the adversary cannot inspect the model’s learned parameters and can only interact with it via a query interface that returns the model’s prediction and confidence). Our work also assumes blackbox model access, with the extra restriction—which we discuss in more detail in Section IIIB—that the model only returns (hard) labels to queries. We note that studying membership inference attacks with whitebox model access [27] has merits (e.g., for upperbounding the membership leakage), but our labelonly restriction inherently presumes a setting where the adversary has blackbox model access only (as otherwise, the adversary could just run locally to obtain confidence scores).
Assuming a blackbox query interface, there are a number of other dimensions to the adversary’s assumed knowledge of the trained model:
refers to global information about the model’s prediction task and, therefore, of its prediction API. Examples of task knowledge include the total number of classes, the classlabels (dog, cat, etc.), and the input format ( RGB or grayscale images, etc.). Task knowledge is always assumed to be known to the adversary, as it is necessary for the classifier service to be useful to a user.
refers to any knowledge about the model architecture (e.g., the type of neural network, its number of layers, etc.) and how it was trained (the training algorithm, size of the training dataset, or number of training steps, etc). Some of this information could be publicly available, or inferable from a model extraction attack [53, 55].
constitutes any knowledge about the data that was used to train the target model. Of course, full knowledge of the training data renders membership inference trivial. Partial knowledge may consist in having access to (or the ability to generate) samples from the same data distribution, or from a related distribution.
refers to knowledge of the true label for each point for which they are predicting membership. Whether knowledge of a data point implies knowledge of its true label depends on the application scenario. Salem et al. [41] show that attacks that rely on knowledge of query labels can often be matched by attacks that do not.
Our paper studies a different query interface than most prior membership inference work. The choice of query interface ultimately depends on the application needs where the target model is deployed. We define two types of query interfaces, with different levels of response granularity:
On a query , the adversary receives the full vector of confidence scores
from the classifier. In a multiclass scenario, each value in this vector corresponds to an estimated confidence that this class is the correct label. Prior work has shown that restricting access to only part of the confidence vector has little effect on the adversary’s success
[44, 54, 41].In this setting, the adversary only obtains the model’s predicted label , without any confidence scores. This is the minimal piece of information that any queryable machine learning model must provide, and is thus the most restrictive query interface, from the adversary’s perspective. Such a query interface is also highly realistic, as the adversary may only get indirect access to a deployed model in many settings. For example, the model may be part of a larger system, which takes actions based on the model’s predictions. Here, the adversary can only observe the system’s actions but not the internal model’s confidence scores.
As our main goal is to show that labelonly attacks can match the success of prior attacks, we consider a simple threat model that matches that typically considered in prior work–except that we notably assume a labelonly query interface.
We assume that the adversary has: (1) full knowledge of the task; (2) knowledge of the target model’s architecture and training setup; (3) partial data knowledge, i.e., access to a disjoint partition of data samples from the same distribution as the target model’s training data; and (4) knowledge of the targeted points’ labels, .
We note that prior work has proposed various techniques to build strong membership inference attacks under relaxed adversarialknowledge assumptions, specifically of reduced data and model architecture knowledge [58, 41]. To simplify our exposition and to center our analysis on comparing the confidencevector and labelonly settings, we leave a finegrained analysis of labelonly attacks under different levels of adversarial knowledge to future work.
We propose new membership inference attacks that improve on existing attacks in two ways:
Our attacks extract finegrained information about the classifier’s decision boundary by combining multiple queries on strategically perturbed samples.
Our attacks are labelonly, i.e., they do not rely on the model returning confidence scores.
Therefore, our attacks pose a threat to any machine learning service that can be queried, regardless of any additional output information it might provide in addition to the predicted label. Moreover, we show that our labelonly attacks can break multiple stateoftheart defenses, that implicitly or explicitly rely on “confidencemasking” (see Section VII).
Labelonly attacks face a challenge of granularity in determining the membership of a data point. For any query , our attack model’s information is limited to only the predicted classlabel, . A simple baseline attack [58]—that predicts any misclassified data point as a nonmember of the training set—is a useful benchmark to assess the extra information that different attacks (whether labelonly or with access to confidence vectors) can extract. We call this baseline the gap attack because its accuracy is directly related to the gap between the model’s accuracy on training data and held out data (see Equation (1)) To glean additional bits of information on top of this baseline attack, any adversary operating in the labelonly regime must necessarily make additional queries to the model.
At a high level, our strategy is to compute labelonly “proxies” for the model’s confidence in a particular prediction, by strategically querying the model on various perturbed versions of . Specifically, we evaluate the target model’s robustness to different input perturbations—either synthetic (i.e., standard data augmentations) or adversarial (i.e., adversarial examples [50])—and predict that data points that exhibit high robustness are training data points.
The intuition for leveraging robustness to data augmentations stems from the fact that many models use data augmentation at training time. Thus, if some data point was used to train the model then so were augmented versions of . By querying the model on these augmented versions of a target point, we aim to obtain a more precise membership signal. In some sense, this can be seen as exploiting the model’s “effective” traintest gap on an augmented dataset.
Even for models that were not trained with data augmentations, studying a model’s robustness to perturbations can serve as a proxy for model confidence, as we now evidence for the special case of (binary) logistic regression models. Given a learned weight vector
and bias , a logistic regression model outputs a confidence score for the positive class of the form:where is the logistic function.
For such a linear model, there is a monotone relationship between the model’s confidence at a point , and the Euclidean distance from to the model’s decision boundary. Specifically, the distance from to the model’s boundary is . Thus, for linear models, obtaining a point’s distance to the boundary yields the same information as the model’s confidence score. As it turns out, computing the distance from a point to the boundary is exactly the problem of finding the smallest adversarial perturbation, which can be done using labelonly access to a classifier [4, 8]. Our thesis is that for deep, nonlinear models, the relationship between a model’s confidence scores and the distance to its boundary will persist.^{3}^{3}3Song et al. [47] also make use of adversarial examples to infer membership. Their approach crucially differs from ours in two aspects: (1) they assume access to confidence scores, and (2) they target models that were explicitly trained to be robust to adversarial examples. In this sense, their approach bares some similarities with our attacks on models trained with data augmentation (see Section VIII, where we also find that a model’s invariance to some perturbations can leak additional membership signal. This thesis is supported by prior work that suggests that deep neural networks can be closely approximated by linear functions in the vicinity of the data [19, 39].
Our data augmentation attacks proceed as follows. Given a target data point , we first create additional data points via different data augmentation strategies, described in more detail below. We then query the target model at all these points (including the original point) to obtain labels . Let be the indicator function for whether the ith queried point was misclassified. Finally, we apply a prediction model to decide whether should be classified as a training set member or not.
To tune the membership classifier , the adversary first locally trains a source (or “shadow”) model , assuming knowledge of the target model’s architecture and of the distribution of its training data. As the adversary knows the training data for , it can train to maximize membership inference accuracy for this local model. The adversary then “transfers” to predict membership of target points using the query responses of the blackbox model .
We experiment with two common data augmentations in the computer vision domain: image rotations and translations.
Our rotation augmentation rotates images to within of the original image. Specifically, given a rotation magnitude , we generate images, including the source, by rotating the source image by .
Our translation attack follows a similar procedure. Given a pixel bound , we translate the image by pixels horizontally, and by vertically for satisfying . Note that this corresponds to translated images in total (plus the original untranslated image).
In Section VIB, we explore the effect of picking different query budgets (i.e., the values and for rotation and translation augmentations) on the attack strength.
The attacks described in this section aim to predict membership based on a point’s distance to the model’s decision boundary. As we have seen above, for linear models the distance to the boundary captures the same information as the model’s confidence score. The attacks below extend this intuition to deeper neural networks.
Given some estimate of a point’s distance to the model’s boundary, we predict that is a training set member if for some threshold . We define for misclassified points, where . To tune the threshold , we train a local source model , and set so as to maximize membership inference accuracy on .
Our first procedure for estimating is an idealized attack that assumes whitebox access to the model, and is therefore not labelonly. To estimate a point’s distance to the boundary, we use adversarialexamples generated by the Carlini and Wagner attack [6]: given the attack tries to find the closest point to in the Euclidean norm, such that .
To make the attack labelonly, we rely on labelonly attacks developed in the adversarial examples literature [4, 8]. Given a point, , these attacks begin by picking a random point such that , and then issue multiple labelonly queries to to find the model’s decision boundary. They then “walk” along the model’s boundary while minimizing the distance to . In our experiments, we use the recent “HopSkipJump” attack [8], which has been shown to closely approximate the distance estimates produced by stronger whitebox attacks (e.g., CarliniWagner), given a few thousand labelonly queries.
As labelonly adversarial examples attacks such as HopSkipJump require a large number of queries, we also explore a much simpler approach based on random perturbations. Again, our intuition stems from linear models: a point’s distance to the boundary is directly related to the model’s accuracy when the point is perturbed by isotropic Gaussian noise [17]. The attack we propose presumes that this relationship also holds for deeper models. We compute a crude proxy for by evaluating the accuracy of on points of the form
. We tune the standard deviation
, as well as the membership threshold , on the adversary’s local source model .Our evaluation is aimed at understanding how labelonly membership inference attacks compare with prior attacks that rely on access to a richer query interface. To this end, we aim to answer the following questions:
Can labelonly membership inference attacks match (or even outperform) prior attacks that make use of the model’s (full) confidence vector?
Under what settings do different labelonly attacks perform best?
Are there settings in which labelonly attacks can improve upon prior attacks?
What defenses are successful against all attacks, whether labelonly or with access to full confidence vectors?
To evaluate an attack’s success, we pick a balanced set of points from a task distribution, of which half come from the target model’s training set. The adversary predicts whether each point was in the training set or not. We measure attack success as overall membership prediction accuracy but find F1 scores to approximately match, with near 100% recall.^{4}^{4}4Some recent works have questioned the use of (balanced) accuracy as a measure of attack success and proposed other measures more suited for imbalanced priors: where any data point targeted by the adversary is apriori unlikely to be a training point [22]. As our main goal is to study the effect of the model’s query interface on the ability to perform membership inference, we focus here on the same balanced setting considered in most prior work. We also note that the assumption that the adversary has a (near) balanced prior need not be unrealistic in practice: For example, the adversary might have query access to models from two different medical studies (trained on patients with two different conditions) and might know apriori that some targeted user participated in one of these studies, without knowing which.
Overall, we stress that the main goal of our evaluation, and of our paper, is to show that in settings where membership inference attacks have been shown to succeed, a labelonly query interface is sufficient. In general, we should not expect our labelonly attacks to exceed the performance of prior membership inference attacks since the former uses strictly less information from queries than the latter. As we will see in Section VII and VIII, two notable exceptions to this are defenses that use “confidence masking” and models trained with significant data augmentations. In both cases, we find that existing attacks severely underestimate membership leakage.
We evaluate our labelonly membership inference attacks on a variety of models trained on standard computer vision tasks, i.e., CIFAR10, CIFAR100 [25], and MNIST [26]. Our focus on vision datasets is mainly due to the important role of data augmentations in the common computer vision pipeline, and to compare directly with prior works that evaluated on similar datasets. We note that the principles behind our attacks carry over to other domains as well.
For each task, we train target neural networks on subsets of the original training data. Controlling the size of the target model’s training set lets us control the amount of overfitting, which strongly influences the strength of membership inference attacks [58]
. Prior works have shown that (confidencebased) membership inference attacks mainly succeed in settings where models exhibit a high degree of overfitting, so we evaluate our labelonly attacks in similar settings. We use two representative model architectures, a standard convolutional neural network (CNN) and a ResNet
[21]. Our CNN has four convolution layers with ReLU activations. The first two
convolutions havefilters and the second two have 64 filters, with a maxpool in between the two. To compute logits we feed the output through a fullyconnected layer with
neurons. This model has million parameters. Our ResNet28 is a standard Wide ResNet28 taken directly from [46] withmillion parameters. All models are trained for 20 to 1000 training epochs, with early stopping when the training loss fails to decrease by
from one epoch to the next.To tune the attack, the adversary trains a source (or shadow) model using an independent, nonoverlapping subset of the tasks’ original training dataset. For the attacks from prior work based on confidence vectors, and our new labelonly attacks based on data augmentations, we use shallow neural networks as membership predictor models . Specifically, for augmentations, we use two layers of 10 neurons and LeakyReLU activations [30]. The confidencevector attack models use a single hidden layer of 64 neurons, as originally proposed by Shokri et al. [44]. We train a separate prediction model for each class. We observe minimal changes in attack performance by changing the architecture, or by replacing the predictor model by a simple thresholding rule. For simplicity, our decision boundary distance attacks use a single global thresholding rule.
To account for randomness in our setup (e.g., subsampling of the data, model training, etc.), we repeat each individual experiment at least times and report the mean and standard deviation when appropriate.
We first focus on question 1), understanding how well our labelonly attacks compare with the canonical confidencevector attacks of Shokri et al. [44]. Recall from Section IVA that any labelonly attack (with knowledge of a target’s true label) is always trivially lowerbounded by the baseline gap attack of Yeom et al. [58], that simply predicts that a point is a nonmember of the training set if it is misclassified.
Our main result is that our labelonly attacks consistently outperform the baseline gap attack, and perform onpar with prior confidencevector attacks.
Figure 1 plots the accuracy of membership inference attacks on CIFAR10, for models trained on up to data points. The confidencevector attack consistently outperforms the baseline gap attack, demonstrating that it exploits nontrivial membership leakage from the model’s query responses. Remarkably, we find that our labelonly boundary distance attack—based on the HopSkipJump attack [8] for finding adversarial examples—performs onpar with, or slightly better than, the confidencevector attacks, despite having access to a more restricted query interface. Moreover, the simpler (and more query efficient, see Section VIB below) labelonly data augmentation attacks also consistently outperform the baseline, but fall short of the full confidencevector attacks. The models in this evaluation did not use data augmentations during training, in Section VIII we find that when they do, our data augmentation attacks outperform all others.
Finally, we verify that as the training set size increases, the performance of the baseline attack, as well as of all the other attacks, monotonically decreases since the model’s generalization gap is reduced.
Table II (a) reports similar results for the CIFAR100 dataset and (c) for the MNIST dataset. Due to the larger size of CIFAR100, we provide results for a single model trained on a subset of data points, which is the largest dataset size we can experiment with since we keep half of the total dataset for training the adversary’s local source model. Mirroring the results on CIFAR10, we find that the confidencevector attack outperforms the gap attack, but that its performance can be matched by our best labelonly attacks.
We now answer question 2) of our evaluation: in what regimes do different labelonly attacks perform best?
Figure 1 shows that the decisionboundary distance attack performs significantly better than our labelonly attacks based on data augmentations. Yet, the decisionboundary attack also requires a large number of queries to the target model—in the order of —while the data augmentation attacks only make a small number of queries (between and ). We now investigate how the success rate of different labelonly attacks is influenced by the attack’s query budget.
Recall that our rotation and translation attacks are parametrized by and , respectively, which control the number of augmented images (queries) that our attacks evaluate (namely for rotations, and for translations). Figure 2 (a)(b) shows how the attack success rate is influenced by these parameters. For both the rotation and translation attack, we find that there is a specific range of perturbation magnitudes for which the attack exceeds the baseline (i.e., for rotations, and for translations). When the augmentations are too small or too large, the attack performs poorly because the augmentations have a similar effect on both train and test samples (i.e., small augmentations rarely change model predictions, whereas large augmentations often cause misclassifications, for train and test samples alike). For both attacks, an optimal parameter choice outperforms the baseline by  percentagepoints. Note that an adversary can tune the best values of and using its local source model. As we will see in Section VIII, these attacks perform significantly better for models that used data augmentation at training time.
In Figure 2 (c), we compare different attacks that approximate the model’s robustness to small perturbations (in the L2norm)—to obtain a proxy for prediction confidence. As an idealized baseline, we use the adversarial examples attack of Carlini and Wagner [6] which assumes whitebox access to the target model. Though not labelonly, its success rate serves as a reasonable upperbound for the amount of membership leakage that can be extracted from the boundary distances. We compare this upper bound to a labelonly attack using HopSkipJump [8]. This attack has two parameters governing its query complexity: the number of iterations and the number of search steps per iteration. By varying these parameters, we explore the tradeoff between query complexity and attack accuracy. As we can see, the attack matches the upperbound given by CarliniWagner with about queries. In this setting, it also matches the best confidencevector attack (see Figure 1). Even in low query regimes (), the attack outperforms the trivial gap attack by .
The final attack we evaluate is a labelonly attack that measures the model’s accuracy under random perturbations. Here, our queries to the target model are of the form . The noise magnitude is tuned to maximize the attack success rate on the adversary’s local source model. Surprisingly this simple attack performs very well in low query regimes. For a query budget , it outperforms the HopSkipJumpbased attack and typically outperforms the data augmentation attacks at a given query budget as well. For large query budgets, the HopSkipJump attack produces more precise distance estimates and outperforms the random attack.
In this section, we answer question 3) and showcase an example where our labelonly attacks outperform prior attacks by a significant margin, despite the strictly more restricted query interface that they assume. We evaluate a number of defenses against membership inference attacks and show that while these attacks do protect against existing confidencevector attacks, they have little to no effect on our labelonly attacks.
We identify a common pattern to these defenses that we call confidence masking. Confidence masking defenses aim to prevent membership inference by directly minimizing the information leakage in a model’s confidence scores. Towards this goal, a defense that relies on confidence masking explicitly or implicitly masks (or, obfuscates) the information contained in the confidence scores returned by the model, so as to thwart existing attacks. We focus our analysis on two defenses in this category: MemGuard [23] and adversarial regularization [33]. However, previously proposed defenses such as reducing the precision or number of returned values of the confidencevector [44, 54, 41] and recent defenses such as prediction purification [57] also rely on this mechanism.
Confidence masking thwarts existing attacks (e.g, by adding noise to the vector) whilst having a minimal effect on the model’s predicted labels. MemGuard [23] and prediction purification [57] explicitly maintain the invariant that the model’s predicted labels are not affected by the defense, i.e.,
where is the defended version of the model . In adversarial regularization [33], instead of explicitly enforcing this constraint at test time, the model is trained to achieve high accuracy whilst simultaneously minimizing the information available in the confidence scores.
There is an immediate issue with design of these confidencemasking defenses: by construction they will prevent neither the gap attack nor our stronger labelonly attacks. Yet, these defenses were reported to drive the success rates of existing membership inference attacks to close to chance. This result suggests that prior attacks fail to properly extract membership leakage information contained in the model’s predicted labels, and indeed, implicitly contained within its scores. At the same time, our results with labelonly attacks clearly indicate that confidence masking is not a viable defense strategy against membership inference.
In the following sections, we show that both MemGuard [23] (CCS’19) and adversarial regularization [33] (CCS’18) fail to prevent the naive gap attack as well as our more elaborate labelonly attacks. In both cases, we show that the defense does not significantly reduce membership leakage, compared to an undefended model.
We implement the MemGuard algorithm for defending against membership inference. This defense solves a constrained optimization problem to compute a defended confidencevector , where is an adversarial noise vector that satisfies the following constraints: (1) the model still outputs a vector of “probabilities”, i.e., and ; (2) the model’s predictions are unchanged, i.e., ; and (3) the noisy confidence vector “fools” existing membership inference attacks. To enforce the third constraint, the defender locally creates a membership attack predictor , and then optimizes the noise to cause to mispredict membership. We consider the strongest version of the defense in [23], that is allowed to make arbitrary changes to the confidence vector (i.e., ) under the constraint that the model’s predicted label is unchanged.
Note that the second constraint guarantees that the defended model’s traintest gap remains unaltered, and the defense thus has no effect on the baseline gap attack. Worse, by construction, this defense cannot prevent any labelonly attacks because it preserves the output label of the model on all inputs.
The main reason that this defense was found to protect against confidencevector attacks in [23] is due to those attacks not being properly adapted to the defense. Specifically, MemGuard is evaluated against confidencevector attacks that are tuned on source models without MemGuard enabled. As a result, these attacks’ membership predictors are tuned to distinguish members from nonmembers based on high confidence scores, which MemGuard obfuscates. In a sense, a labelonly attack like ours is the “right” adaptive attack against MemGuard: since the model’s confidence scores are no longer reliable, the adversary’s best strategy is to extract membership information from hard labels, which the defense explicitly does not modify. Moving forward, we recommend that the trivial gap baseline serve as an indicator of this form of confidence masking: a nonadaptive confidencevector attack should not perform significantly worse than the trivial gap baseline in order for a defense to protect against membership leakage.
From Figure 3, we observe that MemGuard, as expected, offers no protection against our labelonly attacks. All our attacks significantly outperform the canonical (nonadaptive) confidencevector attack, as well as the baseline attack, across all subset sizes that we evaluate. Thus, for a defense to protect against all forms of membership inference attacks, it cannot solely postprocess the confidencevector—doing so will still leave the model vulnerable to labelonly attacks. In Table II (b) and (d), we report similar results on CIFAR100 and MNIST, respectively: while the defense breaks prior confidencebased attacks, it has no effect on the generalization gap, or on our stronger labelonly attacks.
Prediction purification [57] is a similar defense. It trains a purifier model, , that is applied to the output vector of the target model. That is, on a query , the adversary receives . The purifier model is trained so as to minimize the information content in the confidence vector, whilst preserving model accuracy. While the defense does not guarantee that the model’s labels are preserved at all points, the defense is by design incapable of preventing the baseline gap attack, and it is likely that our stronger labelonly attacks would similarly be unaffected (intuitively, is just another deterministic classifier, so the membership leakage from a point’s distance to the decision boundary should not be expected to change).
In a similar vein, many simple defenses proposed in prior work can be broken by labelonly attacks. These includes any types of static defenses that reduce information in confidence scores, such as returning only the topk confidence scores, rounding the confidences to a lower precision, or various ways of noising the confidences [44].
Adversarial regularization [33] differs from MemGuard and prediction purification, in that it does not simply obfuscate confidence vectors at test time. Rather, it jointly trains a target model and an attacker model in a minmax fashion. In alternating training steps, the attacker model is trained to maximize membership inference from the target model’s outputs, and the target model is trained to produce outputs that are accurate yet fool the attacker.
We train a target model defended using adversarial regularization. We use a confidencevector membership classifier as our defensive classifier. This defense’s training has two additional hyperparameters: controls the ratio of maximization to minimization steps during training, and is the regularization constant that balances the target model’s two objectives, i.e., low training error and low membership leakage. We test several values of and find that setting enabled the target model to converge to a defended state. For the regularization term , we try different values and report all results in Figure 4. As we can see, an optimal choice of can reduce the confidencevector attack’s success to within percentagepoints of random guessing. However, the attack is outperformed by our labelonly attacks. The defense has only a moderate effect on the model’s traintest gap, and thus the accuracy of the trivial baseline attack is not reduced. We find that our more complex labelonly attacks do not significantly outperform this baseline for most choices of regularization term , which is consistent with the effects we observe for more common regularization techniques such as L2 weight decay (see Section VIII). Thus, this defense is not entirely ineffective—it does prevent attacks from exploiting much more leakage than the trivial gap attack. And yet, evaluating the defense solely on (nonadaptive) confidencevector attacks leads to an overestimate of the achieved privacy.




Following our findings that confidencemasking defenses cannot robustly defend against membership inference attacks, we now answer question 4). We investigate to what extent we can defend against membership inference attacks via standard regularization techniques—the aim of which is to limit the model’s ability to overfit to the training set. This form of regularization was introduced by the ML community to encourage generalization. In this section, we study the impact of the following common regularization techniques on membership inference: data augmentation, transfer learning, dropout, L1/L2 regularization, and differential privacy.
The case of data augmentation is of particular interest: on the one hand, the regularization effect of data augmentation is expected to reduce membership leakage. On the other, some of our attacks directly exploit the model’s overfitting on augmented data.
We explore three questions in this section:
How does training with data augmentation impact membership inference attacks, especially the ones that query the model on augmented data?
How well do other traditional regularization techniques from the machine learning literature help in reducing membership leakage?
How do these defenses compare to differential privacy, which can provide formal guarantees against any form of membership leakage?
Data augmentation is commonly used in machine learning to prevent a model from overfitting, in particular in low data regimes [45, 31, 52, 15, 16]. Data augmentation is used to increase the diversity of a model’s finite training set, by efficiently synthesizing new data via natural transformations of existing data points that preserve class semantics (e.g., small rotations or translations of images).
Data augmentation presents an interesting case study for our labelonly membership inference attacks. As it reduces the model’s overfitting, one would expect data augmentation to reduce membership leakage. At the same time, a model trained on augmented data will have been trained to strongly recognize not only the original data point, but also a number of augmented versions of it, which is precisely the signal that our labelonly attacks based on data augmentations exploit.
We train target models by incorporating data augmentations similar to those described in Section IVC. We focus here on image translations, as these are most routinely used to train computer vision models. In each training epoch, the model is evaluated on all translations of an image, parametrized by the amount of shifted pixels . This simple pipeline differs slightly from the standard data augmentation pipeline which samples an augmentation at random for each training batch. We opted for this approach to illustrate the maximum leakage incurred when the adversary’s attack queries exactly matches the samples seen during training. Later in this section, we will evaluate a robust pipeline taken directly from FixMatch [46] and show that our results from this simple pipeline transfer.
We plot the success of various membership inference attacks on models trained with data augmentation in Figure 5. First, we observe the effect of augmentations on overfitting: as the model is trained on larger image translations (by up to pixels), the model’s traintest gap decreases. Specifically, the model’s test accuracy grows from without translations to with , corroborating the benefits of data augmentation for improving generalization.
Yet, we find that as the model is trained on more data augmentations: (1) the accuracies of the confidencevector and boundary distance attacks decrease; and (2) the success rate of the data augmentation attack increases.
Regarding the decrease in accuracy of the confidencevector and decision boundary attacks, this should be expected given the model’s improved generalization. The increase in performance of the data augmentation attacks confirms our initial intuition that the model now leaks additional membership information via its invariance to the trainingtime augmentations. Note that the labelonly attack on a model trained with pixel shifts exceeds the accuracy of the confidencevector attack on the original nonaugmented model, despite the model with having a higher test accuracy. This result illustrates that a model’s ability to generalize is not the only variable affecting its membership leakage: models that overfit less on the original training dataset may actually be more vulnerable to membership inference because they implicitly overfit more on a related training set.
In Figure 11 in the Appendix, we investigate how the attack is impacted when the attacker’s choice of the parameter does not match the value used during training. Unsurprisingly, we find that the attack is strongest when the attacker’s guess for is correct, and that it degrades by  percentage points as the difference in magnitude between train and test augmentations grows. We note that data augmentation values for a specific domain and image resolution are often fixed, so adversarial knowledge of the model’s data augmentation pipeline is not a strong assumption.
Following our study of a simple data augmentation scheme, we now aim to understand to what extent membership inference attacks apply to a stateoftheart neural network and data processing pipeline. We use, without modification, the pipeline from FixMatch [46], which trains a ResNet28 to accuracy on the CIFAR10 dataset, comparable to the state of the art. As with our other experiments, this model is trained using a subset of CIFAR10, which sometimes leads to observably overfit models, as indicated by a higher gap attack accuracy. We train models using four regularizations, either all enabled (“With Augmentations”) or disabled (“Without Augmentations”).
random image flips across the horizontal axis,
random image shifts by up to pixels in each direction,
random image cutout [13],
weight decay of magnitude .
Our augmentation attacks are tuned to mimic the training pipeline, since this is the case where our attacks perform best. We evaluate randomly generated augmentations for this attack. These results are reported in Figure 6. Similar to our simple pipeline, we find that the use of augmentations in training consistently improves generalization accuracy. Interestingly, the gap attack accuracy also improves due to a relatively larger increase in training accuracy. Similar to our simple pipeline, the confidencevector attack accuracy is degraded when training with augmentations, but our augmentation attack can now perform on par with (and, in some cases, better than) the confidencevector attack.
An interesting question stemming from our experiments, which we leave for future work, is to understand how much membership leakage can be exploited by querying the target model on augmented data in a setting where the attacker does receive full confidence vectors. As such an adversary receives strictly more information that the labelonly attacks we consider here, we expect such an attack to do at least as well as the best attack in Figure 5, and potentially even better (although all our experiments in this paper do suggest that full confidence vectors provide little additional membership leakage compared to hard labels).
As we have seen, data augmentation does not necessarily prevent membership leakage, despite its positive regularization effect. We now explore questions B)C) and turn to other standard machine learning techniques aimed at preventing overfitting: dropout [48], weight decay (L1/L2 regularization), transfer learning, and differentially private training [1].
Dropout and weight decay are straightforward to add to any neural network. We provide more detail in Appendix A.
Transfer learning improves the generalization of models trained on small datasets. A model is first trained on a larger dataset from a related task, and this model is then finetuned to the specific lowdata task. To finetune the pretrained model, we remove its last layer (so that the model acts as a feature extractor), and train a new linear classifier on top of these features. We call this approach last layer finetuning. An alternative is to finetune the feature extractor together with the final linear layer, i.e., full finetuning.
We pretrain a model on CIFAR100 to a test accuracy of . We then use either full finetuning or lastlayer fine tuning on a subset of CIFAR10. The results of various membership inference attacks are in Figure 7. We compare the gap attack to the best labelonly attack, noting that the best labelonly attack performed on par with the confidencevector attack in all cases. We observe that transfer learning indeed reduces the generalization gap, especially when only the last layer is tuned (this is intuitive as linear layers have less capacity to overfit compared to neural networks). We see that with full finetuning, the model still leaks additional membership information, and thus is not an effective defense. Tuning just the last layer however reduces all attacks to the baseline gap attack, which performs only marginally better than chance. However, we find that full finetuning can achieve better test accuracies, as shown in Figure 9.
Finally, differentially private (DP) training [1] enforces—in a formal sense—that the trained model does not strongly depend on any individual training point. In other words, it does not overfit. In this paper, we use DPSGD [1], a differentially private gradient descent algorithm (see Appendix A for details). We find that to train differentially private models to comparable test accuracy as undefended models, the formal privacy guarantees become mostly meaningless (i.e., ).
We evaluate membership inference attacks against models trained with a wide variety of different defensive techniques in Figure 8. We find that most forms of regularization do not reduce the traintest gap below percentage points, and fail to prevent even the baseline gap attack from reaching accuracy or more. The only two forms of regularization that consistently succeed in reducing membership leakage are strong forms of L2 regularization () and training with differential privacy. In order to better understand the tradeoff between privacy and utility, Figure 9 displays the relationship between each model’s test accuracy and vulnerability to membership inference. As we can see, the models trained with differential privacy and strong L2 regularization prevent membership inference at a high cost in generalization ability. Thus these high levels of regularization are actually causing the model to underfit. The plot also clearly indicates the privacy benefits of transfer learning: among models with a similar level of privacy leakage, these models achieve consistently better generalization, as they benefit from the features learned from nonprivate data. Combining transfer learning and differentially private training can further mitigate privacy leakage at nearly no cost in generalization, yielding models with the best tradeoff. When transfer learning is not an option, dropout appears to perform better.
Figure 8 again illustrates the shortcomings of confidencemasking defenses such as MemGuard and Adversarial Regularization: instead of reducing a model’s traintest gap, they obfuscate model outputs so that existing attacks perform worse than the trivial baseline attack. Our labelonly attacks bypass these obfuscation attempts and break the defenses.
In line with prior work, our experiments show that the best averagecase membership inference attacks often extract more leakage information that the trivial traintest gap baseline, but only by a moderate amount. Moreover, whenever prior attacks do succeed in extracting additional membership leakage, we find that the same can be achieved by an adversary with labelonly query access to the model.
We thus now turn to the study of membership inference in the worstcase, i.e., inferring membership only for a small set of “outlier” users. Intuitively, even if a model generalizes well on average over the data distribution, it might still have overfit to unusual data points in the tails of the distribution [5]. The study of membership inference for outliers was initiated by Long et al. [29]. We follow a similar process as theirs to identify potential outlier data, as described below.
First, the adversary uses a local source model to map each targeted data point to the model’s feature space. That is, for each input we extract the activations in the penultimate layer of . We denote these extracted features as . We define two points as neighbors if their features are close, i.e., , where is the standard cosine distance and is a tunable parameter. We define an outlier as a point with less than neighbors in the source model’s feature space, where is another tunable parameter. Given a dataset of potential targets, and an intended fraction of outliers (e.g., of the data), we tune and so that a fraction of points are defined as outliers.
The adversary then only runs the membership inference attack for the selected outliers. We define the adversary’s success as its precision in inferring membership of outliers in the targeted model’s training set.
We run our membership inference attacks on outliers for the same models that we evaluated in Figure 8. The results are in Figure 10. For different type of regularization schemes, we display the improvement in the attacker’s precision when targeting solely outliers, compared to targeting the entire population. We find that we can always improve the attack by focusing only on outliers, but that strong regularization (e.g., as obtained by L2 weight decay with large , or with differential privacy) prevents membership inference even for outliers. As in the average case, we find that the best labelonly attacks perform onpar with prior confidencevector attacks, so we simply report on the best overall attack in Figure 10.
We have developed three new labelonly membership inference attacks. Their labelonly nature requires fundamentally different attack strategies, that—in turn—cannot be trivially prevented by obfuscating a model’s confidence scores. We have used these attacks to break two stateoftheart defenses to membership inference attacks.
We have found that the problem with these defenses runs deeper, in that they cannot meaningfully prevent a trivial attack that predicts a point as a training member if it is classified correctly. As a result, any defenses against membership inference necessarily have to help reduce a model’s traintest gap.
We have further confirmed that attacks from prior work can, in some settings, extract more membership leakage than this baseline attack, but that the same can be achieved by labelonly attacks that operate in a more restrictive adversarial model.
Finally, via a rigorous evaluation across many proposed defenses to membership inference, we have shown that differential privacy provides the strongest defense against membership inference, both in an averagecase and worstcase sense, but that this may come at a cost in the model’s prediction accuracy.
Data augmentation for lowresource neural machine translation
. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). External Links: Link, Document Cited by: §IIA1.FixMatch: simplifying semisupervised learning with consistency and confidence
. External Links: 2001.07685 Cited by: §IIA1, §VB, Fig. 6, §VIIIA, §VIIIA.Stealing hyperparameters in machine learning
. 2018 IEEE Symposium on Security and Privacy (SP), pp. 36–52. Cited by: §IIIA.Bolton differential privacy for scalable stochastic gradient descentbased analytics
. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, New York, NY, USA, pp. 1307–1322. External Links: ISBN 9781450341974, Link, Document Cited by: §I.
Comments
There are no comments yet.