Label-Only Membership Inference Attacks

by   Christopher A. Choquette-Choo, et al.

Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, without a confidence measure. In this paper, we introduce label-only membership inference attacks. Instead of relying on confidence scores, our attacks evaluate the robustness of a model's predicted labels under perturbations to obtain a fine-grained membership signal. These perturbations include common data augmentations or adversarial examples. We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences. We further demonstrate that label-only attacks break multiple defenses against membership inference attacks that (implicitly or explicitly) rely on a phenomenon we call confidence masking. These defenses modify a model's confidence scores in order to thwart attacks, but leave the model's predicted labels unchanged. Our label-only attacks demonstrate that confidence-masking is not a viable defense strategy against membership inference. Finally, we investigate worst-case label-only attacks, that infer membership for a small number of outlier data points. We show that label-only attacks also match confidence-based attacks in this setting. We find that training models with differential privacy and (strong) L2 regularization are the only known defense strategies that successfully prevents all attacks. This remains true even when the differential privacy budget is too high to offer meaningful provable guarantees.



There are no comments yet.


page 1

page 12

page 13


Sampling Attacks: Amplification of Membership Inference Attacks by Repeated Queries

Machine learning models have been shown to leak information violating th...

Privacy Risks of Securing Machine Learning Models against Adversarial Examples

The arms race between attacks and defenses for machine learning models h...

MemGuard: Defending against Black-Box Membership Inference Attacks via Adversarial Examples

In a membership inference attack, an attacker aims to infer whether a da...

Do Not Trust Prediction Scores for Membership Inference Attacks

Membership inference attacks (MIAs) aim to determine whether a specific ...

TransMIA: Membership Inference Attacks Using Transfer Shadow Training

Transfer learning has been widely studied and gained increasing populari...

Improving Robustness to Model Inversion Attacks via Mutual Information Regularization

This paper studies defense mechanisms against model inversion (MI) attac...

Label Inference Attacks from Log-loss Scores

Log-loss (also known as cross-entropy loss) metric is ubiquitously used ...

Code Repositories


The privML Privacy Evaluator is a tool that assesses ML model's levels of privacy by running different attacks on it.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning algorithms are often trained on sensitive or private user information, such as medical records [24, 49], textual conversations [12, 11], or financial information [35, 42]. These trained models can inadvertently leak information about their training data [44, 34, 5]—thereby violating users’ privacy.

In perhaps the simplest form of information leakage, membership inference [44] attacks allow an adversary to determine whether or not a data point was used in the training data. Revealing just this information can cause harm, as it leaks information pertaining specifically to members of the model’s training data, rather than about the user population as a whole. For example, suppose a model is trained to learn the link between a cancer patient’s morphological data and their reaction to some drug. An adversary in possession of a victim’s morphological data and with query access to the trained model cannot directly infer whether the victim has cancer. However, inferring that the victim’s data was part of the model’s training set reveals that the victim indeed has cancer.

Existing membership inference attacks exploit the higher prediction confidence that models exhibit on their training data [38, 58, 54, 20, 41, 44]. This difference in prediction confidence is largely attributed to overfitting [44, 58]. In these attacks, the adversary queries the model on a target data point to obtain the model’s confidence and infers the target’s membership in the training set based on a decision rule.

A large body of work has been devoted to understanding and mitigating membership inference leakage in ML models. Existing defense strategies fall into two broad categories:

  1. reducing overfitting [54, 44, 41]; and,

  2. perturbing a model’s predictions so as to minimize the success of known membership attacks [33, 23, 57].

Defenses in the first category either use regularization techniques (e.g., dropout, weight decay or early-stopping) developed by the ML community to reduce overfitting, or simply increase the amount of training data [54, 41, 44]. In contrast, the second category of adversary-aware defenses explicitly aim to minimize the membership inference leakage as computed by a particular attack [23, 33, 57]. Existing defenses in this category alter a model’s outputs, either through a modification of the training procedure (e.g, the addition of a loss penalty) or of the inference procedure post-hoc to training (e.g., to flatten returned confidence scores on members).

In this paper, we introduce label-only membership inference attacks. Our threat model makes fewer assumptions compared to prior attacks, in that the adversary can only obtain (hard) labels when querying the trained model, without any prediction confidences. This threat model is more realistic in practice—as many machine learning models deployed in user-facing products are unlikely to expose raw confidence scores.

In the label-only setting, a naive baseline strategy predicts that a target point is a member of the training set when the model’s prediction is correct. We show that even this simple attack matches the best confidence-based attacks in some settings. In order to design label-only attacks that perform better than this baseline, we will necessarily have to make multiple queries to the target model. We show how to extract fine-grained membership leakage by analyzing a model’s robustness to perturbations of the target data, which reveals signatures of the model’s decision boundary geometry. Our adversary queries the model for predictions on augmentations of data points (e.g., rotations and translations in the vision domain) as well as adversarial examples.

In an extensive evaluation we show that our attacks match the performance of confidence-based attacks (see Section VI-A). We further show that our attacks naturally break existing defenses that fall into category (2) discussed above. These defenses either implicitly or explicitly rely on a strategy that we call confidence masking.111Similar to gradient masking from the adversarial examples literature [36]. This strategy consists of masking the membership inference leakage signal contained in the model’s confidence scores, thereby thwarting existing attacks. However, the (hard) labels predicted by the model remain largely unaffected, which explains why such defenses have little to no effect against our label-only attacks. Put differently, confidence masking does not address the inherent privacy leakage that stems from the model being overfit on the training data. This allows us to break two state-of-the-art defenses: MemGuard [23] and adversarial regularization [33]. While these defenses can successfully reduce the accuracy of existing (confidence-based) membership inference attacks to within percentage points of random chance, they have a negligible effect on the success rate of our attacks.

Overall, our evaluation demonstrates that the use of confidence values in membership inference attacks is unnecessary. Existing attacks either do not outperform the naive baseline, or when they do, their performance can be matched by attacks that only rely on the model’s predicted label.

Finally, we argue that successful membership inference defenses should not only protect the privacy of the average user, but of the worst-case outlier user. We find that for some models with low average-case membership leakage, the membership of users in the tails of the distribution can still be inferred with high precision—even with label-only attacks. Models trained with differential-privacy guarantees [1, 7, 56, 3] appear to effectively minimize the amount of membership leakage for all users, even when the formal privacy bounds are close to meaningless (i.e., differential privacy for ).

We make the following contributions:

  1. We introduce the first label-only attacks, leveraging data augmentations and adversarial examples.

  2. We show that confidence masking is not a viable defense to privacy leakage, by breaking two canonical defenses that use it—MemGuard and Adversarial Regularization—with our attacks.

  3. We evaluate two additional techniques to reducing overfitting and find that training with data augmentations can worsen membership inference leakage while transfer learning can mitigate this leakage.

  4. We introduce “outlier membership inference”: a stronger property that defenses should satisfy; at present, differentially private training and (strong) L2 regularization appear to be the only effective defenses.

  5. We will release code to reproduce all our experiments.

Ii Background

Ii-a Machine Learning

We consider supervised classification tasks [32, 43], wherein a model is trained to predict some class label , given input data . Commonly, may be an image or a sentence and would be the corresponding label, for instance, a digit 0-9, an object type, or a text sentiment.

We focus our study on neural networks 


: functions composed as a series of linear-transformation layers, each followed by a non-linear activation. The overall layer structure is called the model’s

architecture and the learnable parameters of the linear transformations are the weights. For a classification problem with -classes, the last layer of a neural network outputs a vector v of

values (often called logits). The

softmax function is typically used to convert the logits into normalized confidence scores:222

While it is common to refer to the output of a softmax as a “probability vector” because its components are in the range

, we refrain from using this terminology given that the scores output by a softmax cannot be rigorously interpreted as probabilities [18].
. For a model , we define the model’s output as the vector of softmax values. The model’s predicted label is the class with highest confidence, i.e., .

Ii-A1 Data Augmentations

Augmenting data aims to improve the generalization of a classifier 

[9, 46, 52]. Data augmentations are commonly used on state-of-the-art-models [21, 9, 37] to create new and larger datasets to learn from, without the need to acquire more labeled data samples (in a costly process). Augmentations are especially important in low-data regimes [40, 14, 10]. Augmentations are domain-specific: they apply to a certain type of input, (e.g., images or text).

We focus on image classifiers, where the main types of augmentations are affine transformations (rotations, reflections, scaling, and shifts), contrast adjustments, cutout [13], and blurring (adding noise). By synthesizing a new data sample as an augmentation of an existing data sample, , the model can learn a more semantically-meaningful set of features. Data augmentations can potentially teach the machine learning model to become invariant to the augmentation (e.g., rotationally or translationally invariant).

Ii-A2 Transfer Learning

Transfer learning is a common technique used to improve generalization in low-data regimes [51]. By leveraging data from a source task, it is possible to transfer knowledge to a target task. A common approach for transfer learning is to train a model on the data of the source task, and then fine-tune this model on data from the output task. In the case of neural networks, it is common to fine-tune either the entire model, or just the last layer.

Ii-B Membership Inference

Membership inference attacks [44] are a form of privacy leakage that identifies if a given data sample was part of a machine learning model’s training dataset.

Given an example and access to a trained model , the adversary uses a classifier or decision rule to compute a membership prediction , with the goal that whenever is a training point. The main challenge in mounting a membership inference attack is creating the classifier , under various assumptions about the adversary’s knowledge of and of its training data distribution.

Prior work assumes that an adversary has only black-box access to the trained model , via a query interface that on input returns part or all of the confidence vector .

Shadow Models

The original membership inference attack of Shokri et al. [44] create a membership classifier by first training a number of local “shadow” models (we will also refer to these as source models). Assuming that the adversary has access to data from the same (or a similar) distribution as ’s training data, the adversary first locally trains a number of auxiliary classifiers on subsets of the data. Since these shadow models are trained by the adversary, their training set and by extension, the membership of any data point in these training sets is known. The adversary can thus construct a dataset of confidence vectors with an associated membership label . Finally, the adversary trains a classifier to predict given . To apply the attack, the adversary queries the targeted model to obtain the confidence vector , and then uses its trained classifier to predict the membership of in ’s training data.

Salem et al. [41] later showed this attack strategy could be successful even without access to data from the same distribution as , and only to data from a similar task (e.g., a different vision task). They also demonstrated that training shadow models is unnecessary: applying a simple threshold on the targeted model’s confidence scores suffices. That is, the adversary predicts that is in ’s training data if the prediction confidence is above a tuned threshold.

Towards Label-only Approaches

Yeom et al. [58] propose a simple baseline attack: the adversary predicts a data point as being a member of the training set when classifies correctly. The accuracy of this baseline attack directly reflects the gap in the model’s train and test accuracy: if overfits on its training data and obtains much higher accuracy on its training data, this baseline attack will achieve non-trivial membership inference. We call this the gap attack. If the adversary’s target points are equally likely to be members or non-members of the training set (for more on this, see Section V-A), this attack achieves an accuracy of


where are the target model’s accuracy on training data and held out data respectively.

To the best of our knowledge, this is the only attack proposed in prior work that makes use of only the model’s predicted label, . Our goal is to investigate how this simple baseline can be surpassed to achieve label-only membership inference attacks that perform on par with attacks that use access to the model’s confidence scores.

Indirect Membership Inference

The work of Long et al. [29] investigates the possibility of membership inference through indirect access, wherein the adversary can only query on inputs that are related to , but not directly. The label-only attacks we present in this paper similarly make use of information gleaned from querying the model on data points related to (specifically, various perturbed versions of ).

The main difference is that we focus on label-only attacks, whereas the work of Long et al. [29] assumes adversarial access to the model’s confidence scores. Our attacks will also be allowed to query at the chosen point , but again only to obtain the model’s predicted label.


Defenses against membership inference broadly fall into two categories.

First, Shokri et al. [44] demonstrated that overfitting plays a role in their attack’s success rate. Thus, standard regularization techniques such as L2 weight normalization [44, 23, 54, 33], dropout [23], or differential privacy have been proposed to defend against membership inference. Heavy regularization has been shown to limit overfitting and to effectively defend against membership inference, but may result in a significant degradation in the model’s accuracy. Moreover, Yeom et al. [58] show that overfitting is sufficient, but not necessary, for membership inference to be possible.

Second, a variety of techniques have been suggested for reducing the information contained in a model’s confidence scores. These include truncating confidence scores to a lower precision [44], reducing the dimensionality of the confidence score vector [44, 54] (e.g., only returning the top k scores), or perturbing confidences via an adversary-aware “minimax” approach [33, 57, 23]. These later defenses modify either the model’s training or inference procedure so that the model produces minimally perturbed confidence vectors that thwart existing membership inference attacks. We refer to defenses in this second category as “confidence-masking” defenses.

Outliers in Membership Inference

Most membership inference research is focused on protecting the average-case user’s privacy: the success of a membership inference attack is evaluated over a large dataset. Long et al. [29] focus on understanding the vulnerability of outliers to membership inference. They show that some outlier data points can be targeted and have their membership inferred to high precision ( outliers at up to 90% precision) [28, 29]. Recent work analyzes membership inference from the defender’s perspective, that is in a white-box setting with complete access to the model, to understand how overfitting impacts membership leakage [27].

Iii Threat Model

Query Interface Attack Feature Knowledge Source
confidence vector train, data, label [44]
confidence vector train, data [28]
confidence vector [41]
confidence vector [41]
confidence vector [58]
label-only label [58]
label-only train, data, label ours
label-only train, data, label ours
TABLE I: Survey of membership inference threat models. For attack features,

is the model’s loss function,

is a data-augmentation of (e.g., image translation), and is the distance from to the decision boundary. Train, data and label knowledge mean, respectively, that the adversary (1) knows the model’s architecture and training algorithm, (2) has access to samples from the training distribution, and (3) knows the true label of the examples being queried.

The goal of a membership inference attack is to determine whether or not a candidate data point was used to train a given model. In Table I, we summarize different sets of assumptions made in prior work about the adversary’s knowledge and query access to the model.

Iii-a Adversarial Knowledge

The membership inference threat model originally introduced by Shokri et al. [44], and used in many subsequent works, assumes that the adversary has black-box access to the model (i.e., the adversary cannot inspect the model’s learned parameters and can only interact with it via a query interface that returns the model’s prediction and confidence). Our work also assumes black-box model access, with the extra restriction—which we discuss in more detail in Section III-B—that the model only returns (hard) labels to queries. We note that studying membership inference attacks with white-box model access [27] has merits (e.g., for upper-bounding the membership leakage), but our label-only restriction inherently presumes a setting where the adversary has black-box model access only (as otherwise, the adversary could just run locally to obtain confidence scores).

Assuming a black-box query interface, there are a number of other dimensions to the adversary’s assumed knowledge of the trained model:

Task Knowledge

refers to global information about the model’s prediction task and, therefore, of its prediction API. Examples of task knowledge include the total number of classes, the class-labels (dog, cat, etc.), and the input format ( RGB or grayscale images, etc.). Task knowledge is always assumed to be known to the adversary, as it is necessary for the classifier service to be useful to a user.

Training Knowledge

refers to any knowledge about the model architecture (e.g., the type of neural network, its number of layers, etc.) and how it was trained (the training algorithm, size of the training dataset, or number of training steps, etc). Some of this information could be publicly available, or inferable from a model extraction attack [53, 55].

Data Knowledge

constitutes any knowledge about the data that was used to train the target model. Of course, full knowledge of the training data renders membership inference trivial. Partial knowledge may consist in having access to (or the ability to generate) samples from the same data distribution, or from a related distribution.

Label Knowledge

refers to knowledge of the true label for each point for which they are predicting membership. Whether knowledge of a data point implies knowledge of its true label depends on the application scenario. Salem et al. [41] show that attacks that rely on knowledge of query labels can often be matched by attacks that do not.

Iii-B Query Interface

Our paper studies a different query interface than most prior membership inference work. The choice of query interface ultimately depends on the application needs where the target model is deployed. We define two types of query interfaces, with different levels of response granularity:

Full confidence vectors

On a query , the adversary receives the full vector of confidence scores

from the classifier. In a multi-class scenario, each value in this vector corresponds to an estimated confidence that this class is the correct label. Prior work has shown that restricting access to only part of the confidence vector has little effect on the adversary’s success 

[44, 54, 41].


In this setting, the adversary only obtains the model’s predicted label , without any confidence scores. This is the minimal piece of information that any query-able machine learning model must provide, and is thus the most restrictive query interface, from the adversary’s perspective. Such a query interface is also highly realistic, as the adversary may only get indirect access to a deployed model in many settings. For example, the model may be part of a larger system, which takes actions based on the model’s predictions. Here, the adversary can only observe the system’s actions but not the internal model’s confidence scores.

In this work, we focus exclusively on the above label-only regime. Thus, in contrast to prior research [44, 20, 54, 41], our attacks can be mounted against any machine learning service, regardless of the granularity of the provided query interface.

Iii-C Our Threat Model

As our main goal is to show that label-only attacks can match the success of prior attacks, we consider a simple threat model that matches that typically considered in prior work–except that we notably assume a label-only query interface.

We assume that the adversary has: (1) full knowledge of the task; (2) knowledge of the target model’s architecture and training setup; (3) partial data knowledge, i.e., access to a disjoint partition of data samples from the same distribution as the target model’s training data; and (4) knowledge of the targeted points’ labels, .

We note that prior work has proposed various techniques to build strong membership inference attacks under relaxed adversarial-knowledge assumptions, specifically of reduced data and model architecture knowledge [58, 41]. To simplify our exposition and to center our analysis on comparing the confidence-vector and label-only settings, we leave a fine-grained analysis of label-only attacks under different levels of adversarial knowledge to future work.

Iv Attack Model Design

We propose new membership inference attacks that improve on existing attacks in two ways:

  1. Our attacks extract fine-grained information about the classifier’s decision boundary by combining multiple queries on strategically perturbed samples.

  2. Our attacks are label-only, i.e., they do not rely on the model returning confidence scores.

Therefore, our attacks pose a threat to any machine learning service that can be queried, regardless of any additional output information it might provide in addition to the predicted label. Moreover, we show that our label-only attacks can break multiple state-of-the-art defenses, that implicitly or explicitly rely on “confidence-masking” (see Section VII).

Iv-a A Naive Baseline: The Gap Attack

Label-only attacks face a challenge of granularity in determining the membership of a data point. For any query , our attack model’s information is limited to only the predicted class-label, . A simple baseline attack [58]—that predicts any misclassified data point as a non-member of the training set—is a useful benchmark to assess the extra information that different attacks (whether label-only or with access to confidence vectors) can extract. We call this baseline the gap attack because its accuracy is directly related to the gap between the model’s accuracy on training data and held out data (see Equation (1)) To glean additional bits of information on top of this baseline attack, any adversary operating in the label-only regime must necessarily make additional queries to the model.

Iv-B Attack Intuition

At a high level, our strategy is to compute label-only “proxies” for the model’s confidence in a particular prediction, by strategically querying the model on various perturbed versions of . Specifically, we evaluate the target model’s robustness to different input perturbations—either synthetic (i.e., standard data augmentations) or adversarial (i.e., adversarial examples [50])—and predict that data points that exhibit high robustness are training data points.

The intuition for leveraging robustness to data augmentations stems from the fact that many models use data augmentation at training time. Thus, if some data point was used to train the model then so were augmented versions of . By querying the model on these augmented versions of a target point, we aim to obtain a more precise membership signal. In some sense, this can be seen as exploiting the model’s “effective” train-test gap on an augmented dataset.

Even for models that were not trained with data augmentations, studying a model’s robustness to perturbations can serve as a proxy for model confidence, as we now evidence for the special case of (binary) logistic regression models. Given a learned weight vector

and bias , a logistic regression model outputs a confidence score for the positive class of the form:

where is the logistic function.

For such a linear model, there is a monotone relationship between the model’s confidence at a point , and the Euclidean distance from to the model’s decision boundary. Specifically, the distance from to the model’s boundary is . Thus, for linear models, obtaining a point’s distance to the boundary yields the same information as the model’s confidence score. As it turns out, computing the distance from a point to the boundary is exactly the problem of finding the smallest adversarial perturbation, which can be done using label-only access to a classifier [4, 8]. Our thesis is that for deep, non-linear models, the relationship between a model’s confidence scores and the distance to its boundary will persist.333Song et al. [47] also make use of adversarial examples to infer membership. Their approach crucially differs from ours in two aspects: (1) they assume access to confidence scores, and (2) they target models that were explicitly trained to be robust to adversarial examples. In this sense, their approach bares some similarities with our attacks on models trained with data augmentation (see Section  VIII, where we also find that a model’s invariance to some perturbations can leak additional membership signal. This thesis is supported by prior work that suggests that deep neural networks can be closely approximated by linear functions in the vicinity of the data  [19, 39].

Iv-C Data Augmentations

Our data augmentation attacks proceed as follows. Given a target data point , we first create additional data points via different data augmentation strategies, described in more detail below. We then query the target model at all these points (including the original point) to obtain labels . Let be the indicator function for whether the i-th queried point was misclassified. Finally, we apply a prediction model to decide whether should be classified as a training set member or not.

To tune the membership classifier , the adversary first locally trains a source (or “shadow”) model , assuming knowledge of the target model’s architecture and of the distribution of its training data. As the adversary knows the training data for , it can train to maximize membership inference accuracy for this local model. The adversary then “transfers” to predict membership of target points using the query responses of the black-box model .

We experiment with two common data augmentations in the computer vision domain: image rotations and translations.


Our rotation augmentation rotates images to within of the original image. Specifically, given a rotation magnitude , we generate images, including the source, by rotating the source image by .


Our translation attack follows a similar procedure. Given a pixel bound , we translate the image by pixels horizontally, and by vertically for satisfying . Note that this corresponds to translated images in total (plus the original untranslated image).

In Section VI-B, we explore the effect of picking different query budgets (i.e., the values and for rotation and translation augmentations) on the attack strength.

Iv-D Decision Boundary Distance

The attacks described in this section aim to predict membership based on a point’s distance to the model’s decision boundary. As we have seen above, for linear models the distance to the boundary captures the same information as the model’s confidence score. The attacks below extend this intuition to deeper neural networks.

Given some estimate of a point’s distance to the model’s boundary, we predict that is a training set member if for some threshold . We define for misclassified points, where . To tune the threshold , we train a local source model , and set so as to maximize membership inference accuracy on .

A White-Box Baseline

Our first procedure for estimating is an idealized attack that assumes white-box access to the model, and is therefore not label-only. To estimate a point’s distance to the boundary, we use adversarial-examples generated by the Carlini and Wagner attack [6]: given the attack tries to find the closest point to in the Euclidean norm, such that .

Label-Only Attacks

To make the attack label-only, we rely on label-only attacks developed in the adversarial examples literature [4, 8]. Given a point, , these attacks begin by picking a random point such that , and then issue multiple label-only queries to to find the model’s decision boundary. They then “walk” along the model’s boundary while minimizing the distance to . In our experiments, we use the recent “HopSkipJump” attack [8], which has been shown to closely approximate the distance estimates produced by stronger white-box attacks (e.g., Carlini-Wagner), given a few thousand label-only queries.

Robustness to Random Noise

As label-only adversarial examples attacks such as HopSkipJump require a large number of queries, we also explore a much simpler approach based on random perturbations. Again, our intuition stems from linear models: a point’s distance to the boundary is directly related to the model’s accuracy when the point is perturbed by isotropic Gaussian noise [17]. The attack we propose presumes that this relationship also holds for deeper models. We compute a crude proxy for by evaluating the accuracy of on points of the form

. We tune the standard deviation

, as well as the membership threshold , on the adversary’s local source model .

V Evaluation Setup

Our evaluation is aimed at understanding how label-only membership inference attacks compare with prior attacks that rely on access to a richer query interface. To this end, we aim to answer the following questions:

  1. Can label-only membership inference attacks match (or even outperform) prior attacks that make use of the model’s (full) confidence vector?

  2. Under what settings do different label-only attacks perform best?

  3. Are there settings in which label-only attacks can improve upon prior attacks?

  4. What defenses are successful against all attacks, whether label-only or with access to full confidence vectors?

V-a On Measuring Success

To evaluate an attack’s success, we pick a balanced set of points from a task distribution, of which half come from the target model’s training set. The adversary predicts whether each point was in the training set or not. We measure attack success as overall membership prediction accuracy but find F1 scores to approximately match, with near 100% recall.444Some recent works have questioned the use of (balanced) accuracy as a measure of attack success and proposed other measures more suited for imbalanced priors: where any data point targeted by the adversary is a-priori unlikely to be a training point [22]. As our main goal is to study the effect of the model’s query interface on the ability to perform membership inference, we focus here on the same balanced setting considered in most prior work. We also note that the assumption that the adversary has a (near-) balanced prior need not be unrealistic in practice: For example, the adversary might have query access to models from two different medical studies (trained on patients with two different conditions) and might know a-priori that some targeted user participated in one of these studies, without knowing which.

Overall, we stress that the main goal of our evaluation, and of our paper, is to show that in settings where membership inference attacks have been shown to succeed, a label-only query interface is sufficient. In general, we should not expect our label-only attacks to exceed the performance of prior membership inference attacks since the former uses strictly less information from queries than the latter. As we will see in Section VII and VIII, two notable exceptions to this are defenses that use “confidence masking” and models trained with significant data augmentations. In both cases, we find that existing attacks severely underestimate membership leakage.

V-B Attack Setup

We evaluate our label-only membership inference attacks on a variety of models trained on standard computer vision tasks, i.e., CIFAR-10, CIFAR-100 [25], and MNIST [26]. Our focus on vision datasets is mainly due to the important role of data augmentations in the common computer vision pipeline, and to compare directly with prior works that evaluated on similar datasets. We note that the principles behind our attacks carry over to other domains as well.

For each task, we train target neural networks on subsets of the original training data. Controlling the size of the target model’s training set lets us control the amount of overfitting, which strongly influences the strength of membership inference attacks [58]

. Prior works have shown that (confidence-based) membership inference attacks mainly succeed in settings where models exhibit a high degree of overfitting, so we evaluate our label-only attacks in similar settings. We use two representative model architectures, a standard convolutional neural network (CNN) and a ResNet 


. Our CNN has four convolution layers with ReLU activations. The first two

convolutions have

filters and the second two have 64 filters, with a max-pool in between the two. To compute logits we feed the output through a fully-connected layer with

neurons. This model has million parameters. Our ResNet-28 is a standard Wide ResNet-28 taken directly from [46] with

million parameters. All models are trained for 20 to 1000 training epochs, with early stopping when the training loss fails to decrease by

from one epoch to the next.

To tune the attack, the adversary trains a source (or shadow) model using an independent, non-overlapping subset of the tasks’ original training dataset. For the attacks from prior work based on confidence vectors, and our new label-only attacks based on data augmentations, we use shallow neural networks as membership predictor models . Specifically, for augmentations, we use two layers of 10 neurons and LeakyReLU activations [30]. The confidence-vector attack models use a single hidden layer of 64 neurons, as originally proposed by Shokri et al. [44]. We train a separate prediction model for each class. We observe minimal changes in attack performance by changing the architecture, or by replacing the predictor model by a simple thresholding rule. For simplicity, our decision boundary distance attacks use a single global thresholding rule.

To account for randomness in our setup (e.g., sub-sampling of the data, model training, etc.), we repeat each individual experiment at least times and report the mean and standard deviation when appropriate.

Vi Evaluation of Label-Only Attacks

Vi-a Label-Only Attacks Match Confidence-Vector Attacks

We first focus on question 1), understanding how well our label-only attacks compare with the canonical confidence-vector attacks of Shokri et al. [44]. Recall from Section IV-A that any label-only attack (with knowledge of a target’s true label) is always trivially lower-bounded by the baseline gap attack of Yeom et al. [58], that simply predicts that a point is a non-member of the training set if it is misclassified.

Our main result is that our label-only attacks consistently outperform the baseline gap attack, and perform on-par with prior confidence-vector attacks.

Figure 1 plots the accuracy of membership inference attacks on CIFAR-10, for models trained on up to data points. The confidence-vector attack consistently outperforms the baseline gap attack, demonstrating that it exploits non-trivial membership leakage from the model’s query responses. Remarkably, we find that our label-only boundary distance attack—based on the HopSkipJump attack [8] for finding adversarial examples—performs on-par with, or slightly better than, the confidence-vector attacks, despite having access to a more restricted query interface. Moreover, the simpler (and more query efficient, see Section VI-B below) label-only data augmentation attacks also consistently outperform the baseline, but fall short of the full confidence-vector attacks. The models in this evaluation did not use data augmentations during training, in Section VIII we find that when they do, our data augmentation attacks outperform all others.

Finally, we verify that as the training set size increases, the performance of the baseline attack, as well as of all the other attacks, monotonically decreases since the model’s generalization gap is reduced.

Table II (a) reports similar results for the CIFAR-100 dataset and (c) for the MNIST dataset. Due to the larger size of CIFAR-100, we provide results for a single model trained on a subset of data points, which is the largest dataset size we can experiment with since we keep half of the total dataset for training the adversary’s local source model. Mirroring the results on CIFAR-10, we find that the confidence-vector attack outperforms the gap attack, but that its performance can be matched by our best label-only attacks.

Fig. 1: Accuracy of membership inference attacks on CIFAR-10. We compare the baseline gap attack, the confidence-vector attack that relies on a fine-grained query interface, and our label-only attacks based on data augmentation and distance to the decision boundary. For the data augmentation attack, we report the best accuracy across multiple values of (rotation angle) and (number of translated pixels).

Vi-B The Query Complexity of Label-Only Attacks

(a) Rotation attack
(b) Translation attack
(c) Boundary distance attack
Fig. 2: Comparison of different label-only membership inference attacks on CIFAR-10. The target model is trained on a subset of data points. In (a) and (b), we compare the performance of the label-only data augmentation attack against two baselines (the naive gap attack, and the confidence-vector attack), as we increase the size of pixel shifts . In (c), we compare attacks that threshold on an estimate of a point’s distance to the boundary. The white-box attack is an idealized baseline that uses Carlini and Wagner’s attack [6]. The label-only attack uses HipSkipJump [8] with various query budgets, and the random noise attack queries the model on a varied number of randomly perturbed samples.

We now answer question 2) of our evaluation: in what regimes do different label-only attacks perform best?

Figure 1 shows that the decision-boundary distance attack performs significantly better than our label-only attacks based on data augmentations. Yet, the decision-boundary attack also requires a large number of queries to the target model—in the order of —while the data augmentation attacks only make a small number of queries (between and ). We now investigate how the success rate of different label-only attacks is influenced by the attack’s query budget.

Recall that our rotation and translation attacks are parametrized by and , respectively, which control the number of augmented images (queries) that our attacks evaluate (namely for rotations, and for translations). Figure 2 (a)-(b) shows how the attack success rate is influenced by these parameters. For both the rotation and translation attack, we find that there is a specific range of perturbation magnitudes for which the attack exceeds the baseline (i.e., for rotations, and for translations). When the augmentations are too small or too large, the attack performs poorly because the augmentations have a similar effect on both train and test samples (i.e., small augmentations rarely change model predictions, whereas large augmentations often cause misclassifications, for train and test samples alike). For both attacks, an optimal parameter choice outperforms the baseline by - percentage-points. Note that an adversary can tune the best values of and using its local source model. As we will see in Section VIII, these attacks perform significantly better for models that used data augmentation at training time.

In Figure 2 (c), we compare different attacks that approximate the model’s robustness to small perturbations (in the L2-norm)—to obtain a proxy for prediction confidence. As an idealized baseline, we use the adversarial examples attack of Carlini and Wagner [6] which assumes white-box access to the target model. Though not label-only, its success rate serves as a reasonable upper-bound for the amount of membership leakage that can be extracted from the boundary distances. We compare this upper bound to a label-only attack using HopSkipJump [8]. This attack has two parameters governing its query complexity: the number of iterations and the number of search steps per iteration. By varying these parameters, we explore the tradeoff between query complexity and attack accuracy. As we can see, the attack matches the upper-bound given by Carlini-Wagner with about queries. In this setting, it also matches the best confidence-vector attack (see Figure 1). Even in low query regimes (), the attack outperforms the trivial gap attack by .

The final attack we evaluate is a label-only attack that measures the model’s accuracy under random perturbations. Here, our queries to the target model are of the form . The noise magnitude is tuned to maximize the attack success rate on the adversary’s local source model. Surprisingly this simple attack performs very well in low query regimes. For a query budget , it outperforms the HopSkipJump-based attack and typically outperforms the data augmentation attacks at a given query budget as well. For large query budgets, the HopSkipJump attack produces more precise distance estimates and outperforms the random attack.

Vii Breaking Confidence-Masking Defenses

In this section, we answer question 3) and showcase an example where our label-only attacks outperform prior attacks by a significant margin, despite the strictly more restricted query interface that they assume. We evaluate a number of defenses against membership inference attacks and show that while these attacks do protect against existing confidence-vector attacks, they have little to no effect on our label-only attacks.

We identify a common pattern to these defenses that we call confidence masking. Confidence masking defenses aim to prevent membership inference by directly minimizing the information leakage in a model’s confidence scores. Towards this goal, a defense that relies on confidence masking explicitly or implicitly masks (or, obfuscates) the information contained in the confidence scores returned by the model, so as to thwart existing attacks. We focus our analysis on two defenses in this category: MemGuard [23] and adversarial regularization [33]. However, previously proposed defenses such as reducing the precision or number of returned values of the confidence-vector [44, 54, 41] and recent defenses such as prediction purification [57] also rely on this mechanism.

Confidence masking thwarts existing attacks (e.g, by adding noise to the vector) whilst having a minimal effect on the model’s predicted labels. MemGuard [23] and prediction purification [57] explicitly maintain the invariant that the model’s predicted labels are not affected by the defense, i.e.,

where is the defended version of the model . In adversarial regularization [33], instead of explicitly enforcing this constraint at test time, the model is trained to achieve high accuracy whilst simultaneously minimizing the information available in the confidence scores.

There is an immediate issue with design of these confidence-masking defenses: by construction they will prevent neither the gap attack nor our stronger label-only attacks. Yet, these defenses were reported to drive the success rates of existing membership inference attacks to close to chance. This result suggests that prior attacks fail to properly extract membership leakage information contained in the model’s predicted labels, and indeed, implicitly contained within its scores. At the same time, our results with label-only attacks clearly indicate that confidence masking is not a viable defense strategy against membership inference.

In the following sections, we show that both MemGuard [23] (CCS’19) and adversarial regularization [33] (CCS’18) fail to prevent the naive gap attack as well as our more elaborate label-only attacks. In both cases, we show that the defense does not significantly reduce membership leakage, compared to an undefended model.

Vii-a Breaking MemGuard

We implement the MemGuard algorithm for defending against membership inference. This defense solves a constrained optimization problem to compute a defended confidence-vector , where is an adversarial noise vector that satisfies the following constraints: (1) the model still outputs a vector of “probabilities”, i.e., and ; (2) the model’s predictions are unchanged, i.e., ; and (3) the noisy confidence vector “fools” existing membership inference attacks. To enforce the third constraint, the defender locally creates a membership attack predictor , and then optimizes the noise to cause to mis-predict membership. We consider the strongest version of the defense in [23], that is allowed to make arbitrary changes to the confidence vector (i.e., ) under the constraint that the model’s predicted label is unchanged.

Note that the second constraint guarantees that the defended model’s train-test gap remains unaltered, and the defense thus has no effect on the baseline gap attack. Worse, by construction, this defense cannot prevent any label-only attacks because it preserves the output label of the model on all inputs.

The main reason that this defense was found to protect against confidence-vector attacks in [23] is due to those attacks not being properly adapted to the defense. Specifically, MemGuard is evaluated against confidence-vector attacks that are tuned on source models without MemGuard enabled. As a result, these attacks’ membership predictors are tuned to distinguish members from non-members based on high confidence scores, which MemGuard obfuscates. In a sense, a label-only attack like ours is the “right” adaptive attack against MemGuard: since the model’s confidence scores are no longer reliable, the adversary’s best strategy is to extract membership information from hard labels, which the defense explicitly does not modify. Moving forward, we recommend that the trivial gap baseline serve as an indicator of this form of confidence masking: a non-adaptive confidence-vector attack should not perform significantly worse than the trivial gap baseline in order for a defense to protect against membership leakage.

From Figure 3, we observe that MemGuard, as expected, offers no protection against our label-only attacks. All our attacks significantly outperform the canonical (non-adaptive) confidence-vector attack, as well as the baseline attack, across all subset sizes that we evaluate. Thus, for a defense to protect against all forms of membership inference attacks, it cannot solely post-process the confidence-vector—doing so will still leave the model vulnerable to label-only attacks. In Table II (b) and (d), we report similar results on CIFAR-100 and MNIST, respectively: while the defense breaks prior confidence-based attacks, it has no effect on the generalization gap, or on our stronger label-only attacks.

Fig. 3: Accuracy of membership inference attacks on CIFAR-10 models protected with MemGuard [23]. The fact that (non-adaptive) confidence-vector attacks perform much worse even that the trivial train-test gap baseline illustrates that these attacks are inappropriate for assessing the robustness of membership inference defenses. The high success of label-only attacks shows that MemGuard performs “confidence masking” and does not address the model’s actual membership leakage.

Prediction purification [57] is a similar defense. It trains a purifier model, , that is applied to the output vector of the target model. That is, on a query , the adversary receives . The purifier model is trained so as to minimize the information content in the confidence vector, whilst preserving model accuracy. While the defense does not guarantee that the model’s labels are preserved at all points, the defense is by design incapable of preventing the baseline gap attack, and it is likely that our stronger label-only attacks would similarly be unaffected (intuitively, is just another deterministic classifier, so the membership leakage from a point’s distance to the decision boundary should not be expected to change).

In a similar vein, many simple defenses proposed in prior work can be broken by label-only attacks. These includes any types of static defenses that reduce information in confidence scores, such as returning only the top-k confidence scores, rounding the confidences to a lower precision, or various ways of noising the confidences [44].

Vii-B Breaking Adversarial Regularization

Fig. 4: Accuracy of membership inference attacks on CIFAR-10 models protected with Adversarial Regularization [33]. This defense strategy does not explicitly aim to reduce the train-test gap and thus does not protect against label-only attacks. However, we find that this defense can prevent attacks from exploiting much leakage beyond the gap.

Adversarial regularization [33] differs from MemGuard and prediction purification, in that it does not simply obfuscate confidence vectors at test time. Rather, it jointly trains a target model and an attacker model in a min-max fashion. In alternating training steps, the attacker model is trained to maximize membership inference from the target model’s outputs, and the target model is trained to produce outputs that are accurate yet fool the attacker.

We train a target model defended using adversarial regularization. We use a confidence-vector membership classifier as our defensive classifier. This defense’s training has two additional hyper-parameters: controls the ratio of maximization to minimization steps during training, and is the regularization constant that balances the target model’s two objectives, i.e., low training error and low membership leakage. We test several values of and find that setting enabled the target model to converge to a defended state. For the regularization term , we try different values and report all results in Figure 4. As we can see, an optimal choice of can reduce the confidence-vector attack’s success to within percentage-points of random guessing. However, the attack is outperformed by our label-only attacks. The defense has only a moderate effect on the model’s train-test gap, and thus the accuracy of the trivial baseline attack is not reduced. We find that our more complex label-only attacks do not significantly outperform this baseline for most choices of regularization term , which is consistent with the effects we observe for more common regularization techniques such as L2 weight decay (see Section VIII). Thus, this defense is not entirely ineffective—it does prevent attacks from exploiting much more leakage than the trivial gap attack. And yet, evaluating the defense solely on (non-adaptive) confidence-vector attacks leads to an overestimate of the achieved privacy.

Attack Accuracy
Gap attack
Data augmentation
Boundary distance
(a) CIFAR-10 Undefended
Attack Accuracy
Gap attack
Data augmentation
Boundary distance
(b) CIFAR-10 MemGuard
Attack Accuracy
Gap attack
Data augmentation
Boundary distance
(c) MNIST Undefended
Attack Accuracy
Gap attack
Data augmentation
Boundary distance
(d) MNIST MemGuard
TABLE II: Accuracy of membership inference attacks on CIFAR-100 and MNIST. The target models are trained using data points for CIFAR-10 and for MNIST. Table (a) reports results without any defense, and (b) reports results with MemGuard [23], which prevents the confidence-vector attacks via “confidence-masking”.

Viii Defending with Better Generalization

Following our findings that confidence-masking defenses cannot robustly defend against membership inference attacks, we now answer question 4). We investigate to what extent we can defend against membership inference attacks via standard regularization techniques—the aim of which is to limit the model’s ability to overfit to the training set. This form of regularization was introduced by the ML community to encourage generalization. In this section, we study the impact of the following common regularization techniques on membership inference: data augmentation, transfer learning, dropout, L1/L2 regularization, and differential privacy.

The case of data augmentation is of particular interest: on the one hand, the regularization effect of data augmentation is expected to reduce membership leakage. On the other, some of our attacks directly exploit the model’s overfitting on augmented data.

We explore three questions in this section:

  1. How does training with data augmentation impact membership inference attacks, especially the ones that query the model on augmented data?

  2. How well do other traditional regularization techniques from the machine learning literature help in reducing membership leakage?

  3. How do these defenses compare to differential privacy, which can provide formal guarantees against any form of membership leakage?

Viii-a Training with Data Augmentation Exacerbates Membership Leakage

Fig. 5: Accuracy of membership inference attacks on CIFAR-10 models trained with data augmentation. Target models are trained on a subset of 2500 images. The parameter controls, as in our attack, the number of pixels by which images are translated during training. Training without data augmentation corresponds to . For models trained with significant amounts of data augmentation, membership inference attacks become stronger despite the model generalizing better. Moreover, label-only attacks based on data augmentation either perform as well or better than other membership inference attacks.

Data augmentation is commonly used in machine learning to prevent a model from overfitting, in particular in low data regimes [45, 31, 52, 15, 16]. Data augmentation is used to increase the diversity of a model’s finite training set, by efficiently synthesizing new data via natural transformations of existing data points that preserve class semantics (e.g., small rotations or translations of images).

Data augmentation presents an interesting case study for our label-only membership inference attacks. As it reduces the model’s overfitting, one would expect data augmentation to reduce membership leakage. At the same time, a model trained on augmented data will have been trained to strongly recognize not only the original data point, but also a number of augmented versions of it, which is precisely the signal that our label-only attacks based on data augmentations exploit.

We train target models by incorporating data augmentations similar to those described in Section IV-C. We focus here on image translations, as these are most routinely used to train computer vision models. In each training epoch, the model is evaluated on all translations of an image, parametrized by the amount of shifted pixels . This simple pipeline differs slightly from the standard data augmentation pipeline which samples an augmentation at random for each training batch. We opted for this approach to illustrate the maximum leakage incurred when the adversary’s attack queries exactly matches the samples seen during training. Later in this section, we will evaluate a robust pipeline taken directly from FixMatch [46] and show that our results from this simple pipeline transfer.

We plot the success of various membership inference attacks on models trained with data augmentation in Figure 5. First, we observe the effect of augmentations on overfitting: as the model is trained on larger image translations (by up to pixels), the model’s train-test gap decreases. Specifically, the model’s test accuracy grows from without translations to with , corroborating the benefits of data augmentation for improving generalization.

Yet, we find that as the model is trained on more data augmentations: (1) the accuracies of the confidence-vector and boundary distance attacks decrease; and (2) the success rate of the data augmentation attack increases.

Regarding the decrease in accuracy of the confidence-vector and decision boundary attacks, this should be expected given the model’s improved generalization. The increase in performance of the data augmentation attacks confirms our initial intuition that the model now leaks additional membership information via its invariance to the training-time augmentations. Note that the label-only attack on a model trained with pixel shifts exceeds the accuracy of the confidence-vector attack on the original non-augmented model, despite the model with having a higher test accuracy. This result illustrates that a model’s ability to generalize is not the only variable affecting its membership leakage: models that overfit less on the original training dataset may actually be more vulnerable to membership inference because they implicitly overfit more on a related training set.

In Figure 11 in the Appendix, we investigate how the attack is impacted when the attacker’s choice of the parameter does not match the value used during training. Unsurprisingly, we find that the attack is strongest when the attacker’s guess for is correct, and that it degrades by - percentage points as the difference in magnitude between train and test augmentations grows. We note that data augmentation values for a specific domain and image resolution are often fixed, so adversarial knowledge of the model’s data augmentation pipeline is not a strong assumption.

(a) Without Augmentations
(b) With Augmentations
Fig. 6: Accuracy of membership inference attacks on CIFAR-10 models trained as in FixMatch [46]. Our data augmentation attacks, which mimic the training augmentations, match or outperform the confidence-vector attacks when augmentations were used in training. As in previous experiments, we find our label-only distance attack performs on par with the confidence-vector attack. “With Augmentations” and “Without Augmentations” refer to using all regularizations, as in FixMatch, or none, respectively.

Following our study of a simple data augmentation scheme, we now aim to understand to what extent membership inference attacks apply to a state-of-the-art neural network and data processing pipeline. We use, without modification, the pipeline from FixMatch [46], which trains a ResNet-28 to accuracy on the CIFAR-10 dataset, comparable to the state of the art. As with our other experiments, this model is trained using a subset of CIFAR-10, which sometimes leads to observably overfit models, as indicated by a higher gap attack accuracy. We train models using four regularizations, either all enabled (“With Augmentations”) or disabled (“Without Augmentations”).

  1. random image flips across the horizontal axis,

  2. random image shifts by up to pixels in each direction,

  3. random image cutout [13],

  4. weight decay of magnitude .

Our augmentation attacks are tuned to mimic the training pipeline, since this is the case where our attacks perform best. We evaluate randomly generated augmentations for this attack. These results are reported in Figure 6. Similar to our simple pipeline, we find that the use of augmentations in training consistently improves generalization accuracy. Interestingly, the gap attack accuracy also improves due to a relatively larger increase in training accuracy. Similar to our simple pipeline, the confidence-vector attack accuracy is degraded when training with augmentations, but our augmentation attack can now perform on par with (and, in some cases, better than) the confidence-vector attack.

An interesting question stemming from our experiments, which we leave for future work, is to understand how much membership leakage can be exploited by querying the target model on augmented data in a setting where the attacker does receive full confidence vectors. As such an adversary receives strictly more information that the label-only attacks we consider here, we expect such an attack to do at least as well as the best attack in Figure 5, and potentially even better (although all our experiments in this paper do suggest that full confidence vectors provide little additional membership leakage compared to hard labels).

Viii-B Other Techniques to Prevent Overfitting

As we have seen, data augmentation does not necessarily prevent membership leakage, despite its positive regularization effect. We now explore questions B)-C) and turn to other standard machine learning techniques aimed at preventing overfitting: dropout [48], weight decay (L1/L2 regularization), transfer learning, and differentially private training [1].

Dropout and weight decay are straightforward to add to any neural network. We provide more detail in Appendix -A.

Transfer learning improves the generalization of models trained on small datasets. A model is first trained on a larger dataset from a related task, and this model is then fine-tuned to the specific low-data task. To fine-tune the pre-trained model, we remove its last layer (so that the model acts as a feature extractor), and train a new linear classifier on top of these features. We call this approach last layer fine-tuning. An alternative is to fine-tune the feature extractor together with the final linear layer, i.e., full fine-tuning.

Fig. 7: Accuracy of membership inference attacks on CIFAR-10 models trained with transfer learning. The source model for transfer learning is trained on all of CIFAR-100. Models are tuned on subsets of CIFAR-10.
(a) Label-Only Attacks
(b) Full Confidence-Vector Attacks
Fig. 8: Comparison of confidence-vector and label-only attacks on models with varied defenses. Target models are trained on 2500 data points from CIFAR-10. Point sizes indicate relative regularization amounts within a defense.
Fig. 9: Test accuracy and label-only attack accuracy for models with varied defenses. Target models are trained on 2500 data points from CIFAR-10. Models towards the bottom right are more private and more accurate. Point sizes indicate relative regularization amounts within a defense.

We pre-train a model on CIFAR-100 to a test accuracy of . We then use either full fine-tuning or last-layer fine tuning on a subset of CIFAR-10. The results of various membership inference attacks are in Figure 7. We compare the gap attack to the best label-only attack, noting that the best label-only attack performed on par with the confidence-vector attack in all cases. We observe that transfer learning indeed reduces the generalization gap, especially when only the last layer is tuned (this is intuitive as linear layers have less capacity to overfit compared to neural networks). We see that with full fine-tuning, the model still leaks additional membership information, and thus is not an effective defense. Tuning just the last layer however reduces all attacks to the baseline gap attack, which performs only marginally better than chance. However, we find that full fine-tuning can achieve better test accuracies, as shown in Figure 9.

Finally, differentially private (DP) training [1] enforces—in a formal sense—that the trained model does not strongly depend on any individual training point. In other words, it does not overfit. In this paper, we use DP-SGD  [1], a differentially private gradient descent algorithm (see Appendix -A for details). We find that to train differentially private models to comparable test accuracy as undefended models, the formal privacy guarantees become mostly meaningless (i.e., ).

We evaluate membership inference attacks against models trained with a wide variety of different defensive techniques in Figure 8. We find that most forms of regularization do not reduce the train-test gap below percentage points, and fail to prevent even the baseline gap attack from reaching accuracy or more. The only two forms of regularization that consistently succeed in reducing membership leakage are strong forms of L2 regularization () and training with differential privacy. In order to better understand the tradeoff between privacy and utility, Figure 9 displays the relationship between each model’s test accuracy and vulnerability to membership inference. As we can see, the models trained with differential privacy and strong L2 regularization prevent membership inference at a high cost in generalization ability. Thus these high levels of regularization are actually causing the model to underfit. The plot also clearly indicates the privacy benefits of transfer learning: among models with a similar level of privacy leakage, these models achieve consistently better generalization, as they benefit from the features learned from non-private data. Combining transfer learning and differentially private training can further mitigate privacy leakage at nearly no cost in generalization, yielding models with the best tradeoff. When transfer learning is not an option, dropout appears to perform better.

Figure 8 again illustrates the shortcomings of confidence-masking defenses such as MemGuard and Adversarial Regularization: instead of reducing a model’s train-test gap, they obfuscate model outputs so that existing attacks perform worse than the trivial baseline attack. Our label-only attacks bypass these obfuscation attempts and break the defenses.

Ix Worst-Case Membership Inference

In line with prior work, our experiments show that the best average-case membership inference attacks often extract more leakage information that the trivial train-test gap baseline, but only by a moderate amount. Moreover, whenever prior attacks do succeed in extracting additional membership leakage, we find that the same can be achieved by an adversary with label-only query access to the model.

We thus now turn to the study of membership inference in the worst-case, i.e., inferring membership only for a small set of “outlier” users. Intuitively, even if a model generalizes well on average over the data distribution, it might still have overfit to unusual data points in the tails of the distribution [5]. The study of membership inference for outliers was initiated by Long et al. [29]. We follow a similar process as theirs to identify potential outlier data, as described below.

First, the adversary uses a local source model to map each targeted data point to the model’s feature space. That is, for each input we extract the activations in the penultimate layer of . We denote these extracted features as . We define two points as neighbors if their features are close, i.e., , where is the standard cosine distance and is a tunable parameter. We define an outlier as a point with less than neighbors in the source model’s feature space, where is another tunable parameter. Given a dataset of potential targets, and an intended fraction of outliers (e.g., of the data), we tune and so that a -fraction of points are defined as outliers.

Fig. 10: Outlier membership inference attacks on defended models. Target and source models are trained on a subset of 2500 points from CIFAR-10. outliers are identified with less than neighbors. We show precision-improvement from the undefended model, using our best label-only membership inference attack.

The adversary then only runs the membership inference attack for the selected outliers. We define the adversary’s success as its precision in inferring membership of outliers in the targeted model’s training set.

We run our membership inference attacks on outliers for the same models that we evaluated in Figure 8. The results are in Figure 10. For different type of regularization schemes, we display the improvement in the attacker’s precision when targeting solely outliers, compared to targeting the entire population. We find that we can always improve the attack by focusing only on outliers, but that strong regularization (e.g., as obtained by L2 weight decay with large , or with differential privacy) prevents membership inference even for outliers. As in the average case, we find that the best label-only attacks perform on-par with prior confidence-vector attacks, so we simply report on the best overall attack in Figure 10.

X Conclusion

We have developed three new label-only membership inference attacks. Their label-only nature requires fundamentally different attack strategies, that—in turn—cannot be trivially prevented by obfuscating a model’s confidence scores. We have used these attacks to break two state-of-the-art defenses to membership inference attacks.

We have found that the problem with these defenses runs deeper, in that they cannot meaningfully prevent a trivial attack that predicts a point as a training member if it is classified correctly. As a result, any defenses against membership inference necessarily have to help reduce a model’s train-test gap.

We have further confirmed that attacks from prior work can, in some settings, extract more membership leakage than this baseline attack, but that the same can be achieved by label-only attacks that operate in a more restrictive adversarial model.

Finally, via a rigorous evaluation across many proposed defenses to membership inference, we have shown that differential privacy provides the strongest defense against membership inference, both in an average-case and worst-case sense, but that this may come at a cost in the model’s prediction accuracy.


  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 308–318. External Links: ISBN 9781450341394, Link, Document Cited by: §-A, §I, §VIII-B, §VIII-B.
  • [2] Y. Bengio, I. Goodfellow, and A. Courville (2017) Deep learning. Vol. 1, MIT press. Cited by: §II-A.
  • [3] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2017) Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, New York, NY, USA, pp. 1175–1191. External Links: ISBN 9781450349468, Link, Document Cited by: §I.
  • [4] W. Brendel, J. Rauber, and M. Bethge (2017) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §IV-B, §IV-D.
  • [5] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song (2019) The secret sharer: evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284. Cited by: §I, §IX.
  • [6] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §IV-D, Fig. 2, §VI-B.
  • [7] K. Chaudhuri and C. Monteleoni (2009) Privacy-preserving logistic regression. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), pp. 289–296. External Links: Link Cited by: §I.
  • [8] J. Chen, M. I. Jordan, and M. J. Wainwright (2019) HopSkipJumpAttack: a query-efficient decision-based attack. External Links: 1904.02144 Cited by: §IV-B, §IV-D, Fig. 2, §VI-A, §VI-B.
  • [9] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) AutoAugment: learning augmentation policies from data. External Links: 1805.09501 Cited by: §II-A1.
  • [10] X. Cui, V. Goel, and B. Kingsbury (2015) Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (9), pp. 1469–1477. Cited by: §II-A1.
  • [11] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. External Links: 1901.02860 Cited by: §I.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. External Links: 1810.04805 Cited by: §I.
  • [13] T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §II-A1, item 3.
  • [14] M. Fadaee, A. Bisazza, and C. Monz (2017)

    Data augmentation for low-resource neural machine translation

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). External Links: Link, Document Cited by: §II-A1.
  • [15] M. Fadaee, A. Bisazza, and C. Monz (2017) Data augmentation for low-resource neural machine translation. In ACL, Cited by: §VIII-A.
  • [16] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2018) Data augmentation using synthetic data for time series classification with deep residual networks. External Links: 1808.02455 Cited by: §VIII-A.
  • [17] N. Ford, J. Gilmer, N. Carlini, and D. Cubuk (2019) Adversarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513. Cited by: §IV-D.
  • [18] Y. Gal (2016) Uncertainty in deep learning. University of Cambridge 1, pp. 3. Cited by: footnote 2.
  • [19] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. External Links: 1412.6572 Cited by: §IV-B.
  • [20] J. Hayes, L. Melis, G. Danezis, and E. D. Cristofaro (2019) LOGAN: membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies 2019 (1), pp. 133 – 152. External Links: Link Cited by: §I, §III-B.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §II-A1, §V-B.
  • [22] B. Jayaraman, L. Wang, D. Evans, and Q. Gu (2020) Revisiting membership inference under realistic assumptions. arXiv preprint arXiv:2005.10881. Cited by: footnote 4.
  • [23] J. Jia, A. Salem, M. Backes, Y. Zhang, and N. Z. Gong (2019) MemGuard: defending against black-box membership inference attacks via adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 259–274. Cited by: item 2, §I, §I, §II-B, §II-B, Fig. 3, §VII-A, §VII-A, TABLE II, §VII, §VII, §VII.
  • [24] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis (2015) Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal 13, pp. 8–17. Cited by: §I.
  • [25] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §V-B.
  • [26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN Cited by: §V-B.
  • [27] K. Leino and M. Fredrikson (2019) Stolen memories: leveraging model memorization for calibrated white-box membership inference. arXiv preprint arXiv:1906.11798. Cited by: §II-B, §III-A.
  • [28] Y. Long, V. Bindschaedler, and C. A. Gunter (2017) Towards measuring membership privacy. arXiv preprint arXiv:1712.09136. Cited by: §II-B, TABLE I.
  • [29] Y. Long, V. Bindschaedler, L. Wang, D. Bu, X. Wang, H. Tang, C. A. Gunter, and K. Chen (2018) Understanding membership inferences on well-generalized learning models. arXiv preprint arXiv:1802.04889. Cited by: §II-B, §II-B, §II-B, §IX.
  • [30] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §V-B.
  • [31] A. Mikołajczyk and M. Grochowski (2018-05) Data augmentation for improving deep learning in image classification problem. In 2018 International Interdisciplinary PhD Workshop (IIPhDW), Vol. , pp. 117–122. External Links: Document, ISSN null Cited by: §VIII-A.
  • [32] K. P. Murphy (2012) Machine learning: a probabilistic perspective. The MIT Press. External Links: ISBN 0262018020, 9780262018029 Cited by: §II-A.
  • [33] M. Nasr, R. Shokri, and A. Houmansadr (2018) Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 634–646. Cited by: item 2, §I, §I, §II-B, §II-B, Fig. 4, §VII-B, §VII, §VII, §VII.
  • [34] M. Naveed, S. Kamara, and C. V. Wright (2015) Inference attacks on property-preserving encrypted databases. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 644–655. Cited by: §I.
  • [35] E. W. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and X. Sun (2011) The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decision support systems 50 (3), pp. 559–569. Cited by: §I.
  • [36] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, New York, NY, USA, pp. 506–519. External Links: ISBN 978-1-4503-4944-4, Link, Document Cited by: footnote 1.
  • [37] L. Perez and J. Wang (2017) The effectiveness of data augmentation in image classification using deep learning. External Links: 1712.04621 Cited by: §II-A1.
  • [38] A. Pyrgelis, C. Troncoso, and E. De Cristofaro (2017) Knock knock, who’s there? membership inference on aggregate location data. arXiv preprint arXiv:1708.06145. Cited by: §I.
  • [39] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §IV-B.
  • [40] M. Sajjad, S. Khan, K. Muhammad, W. Wu, A. Ullah, and S. W. Baik (2019) Multi-grade brain tumor classification using deep cnn with extensive data augmentation. Journal of computational science 30, pp. 174–182. Cited by: §II-A1.
  • [41] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes (2018) ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. External Links: 1806.01246 Cited by: item 1, §I, §I, §II-B, §III-A, §III-B, §III-B, §III-C, TABLE I, §VII.
  • [42] M. Schuckert, X. Liu, and R. Law (2015) Hospitality and tourism online reviews: recent trends and future directions. Journal of Travel & Tourism Marketing 32 (5), pp. 608–621. Cited by: §I.
  • [43] S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §II-A.
  • [44] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2016) Membership inference attacks against machine learning models. External Links: 1610.05820 Cited by: item 1, §I, §I, §I, §I, §II-B, §II-B, §II-B, §II-B, §III-A, §III-B, §III-B, TABLE I, §V-B, §VI-A, §VII-A, §VII.
  • [45] C. Shorten and T. M. Khoshgoftaar (2019) A survey on image data augmentation for deep learning. Journal of Big Data 6 (1), pp. 60. Cited by: §VIII-A.
  • [46] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020)

    FixMatch: simplifying semi-supervised learning with consistency and confidence

    External Links: 2001.07685 Cited by: §II-A1, §V-B, Fig. 6, §VIII-A, §VIII-A.
  • [47] L. Song, R. Shokri, and P. Mittal (2019) Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 241–257. Cited by: footnote 3.
  • [48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §-A, §VIII-B.
  • [49] M. H. Stanfill, M. Williams, S. H. Fenton, R. A. Jenders, and W. R. Hersh (2010) A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association 17 (6), pp. 646–651. Cited by: §I.
  • [50] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §IV-B.
  • [51] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270–279. Cited by: §II-A2.
  • [52] L. Taylor and G. Nitschke (2018-11) Improving deep learning with generic data augmentation. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1542–1547. External Links: Document, ISSN null Cited by: §II-A1, §VIII-A.
  • [53] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2016) Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618. Cited by: §III-A.
  • [54] S. Truex, L. Liu, M. E. Gursoy, L. Yu, and W. Wei (2018) Towards demystifying membership inference attacks. arXiv preprint arXiv:1807.09173. Cited by: item 1, §I, §I, §II-B, §II-B, §III-B, §III-B, §VII.
  • [55] B. Wang and N. Z. Gong (2018)

    Stealing hyperparameters in machine learning

    2018 IEEE Symposium on Security and Privacy (SP), pp. 36–52. Cited by: §III-A.
  • [56] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton (2017)

    Bolt-on differential privacy for scalable stochastic gradient descent-based analytics

    In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, New York, NY, USA, pp. 1307–1322. External Links: ISBN 9781450341974, Link, Document Cited by: §I.
  • [57] Z. Yang, B. Shao, B. Xuan, E. Chang, and F. Zhang (2020) Defending model inversion and membership inference attacks via prediction purification. External Links: 2005.03915 Cited by: item 2, §I, §II-B, §VII-A, §VII, §VII.
  • [58] S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha (2018) Privacy risk in machine learning: analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. Cited by: §I, §II-B, §II-B, §III-C, TABLE I, §IV-A, §V-B, §VI-A.