1. Introduction
Machine learning models, especially deep neural networks, have been deployed prominently in many realworld applications, such as image classification
(Krizhevsky et al., 2012; Simonyan and Zisserman, 2015), speech recognition (Hinton et al., 2012; Deng et al., 2013)(Collobert et al., 2011; Andor et al., 2016), and game playing (Silver et al., 2016; Moravčík et al., 2017). However, since the machine learning algorithms were originally designed without considering potential adversarial threats, their security and privacy vulnerabilities have come to a forefront in recent years, together with the arms race between attacks and defenses (Huang et al., 2011; Biggio and Roli, 2018; Papernot et al., 2018).In the security domain, the adversary aims to induce misclassifications to the target machine learning model, with attack methods divided into two categories: evasion attacks and poisoning attacks (Huang et al., 2011). Evasion attacks, also known as adversarial examples, perturb inputs at the test time to induce wrong predictions by the target model (Biggio et al., 2013; Szegedy et al., 2014; Carlini and Wagner, 2017; Goodfellow et al., 2015; Papernot et al., 2016). In contrast, poisoning attacks target the training process by maliciously modifying part of training data to cause the trained model to misbehave on some test inputs (Biggio et al., 2012; Koh and Liang, 2017; Shafahi et al., 2018). In response to these attacks, the security community has designed new training algorithms to secure machine learning models against evasion attacks (Madry et al., 2018; Sinha et al., 2018; Zhang et al., 2019; Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018) or poisoning attacks (Steinhardt et al., 2017; Jagielski et al., 2018).
In the privacy domain, the adversary aims to obtain private information about the training data or the target model. Attacks targeting data privacy include: the adversary inferring whether input examples were used to train the target model with membership inference attacks (Shokri et al., 2017; Yeom et al., 2018; Salem et al., 2019), learning global properties of training data with property inference attacks (Ganju et al., 2018), reconstructing training data with model inversion attacks (Fredrikson et al., 2014; Fredrikson et al., 2015) or colluding model training attacks (Song et al., 2017; Yeom et al., 2018). Attacks targeting model privacy include: the adversary unconvering the model details with model extraction attacks (Tramèr et al., 2016)
, and inferring hyperparameters with hyperparameter stealing attacks
(Wang and Gong, 2018). In response to these attacks, the privacy community has designed defenses to prevent privacy leakage of training data (Nasr et al., 2018; Hayes and Ohrimenko, 2018; Shokri and Shmatikov, 2015; Abadi et al., 2016) or the target model (Kesarwani et al., 2018; Lee et al., 2019).However, one important limitation of current machine learning defenses is that they typically focus solely on either the security domain or the privacy domain. It is thus unclear whether defense methods in one domain will have any unexpected impact on the other domain. In this paper, we take a step towards enhancing our understanding of machine learning models when both the security domain and privacy domain are combined together. In particular, we seek to understand the privacy risks of securing machine learning models by evaluating
membership inference attacks against adversarially robust deep learning models
, which aim to mitigate the threat of adversarial examples.Histogram of CIFAR10 classifiers’ loss values of training data (members) and test data (nonmembers). We can see the larger divergence between the loss distribution over members and nonmembers on the robust model as compared to the natural model. This shows the privacy risk of securing deep learning models against adversarial examples.
The membership inference attack aims to infer whether a data point is part of the target model’s training set or not, posing a serious privacy risk as the membership can reveal an individual’s sensitive information. For example, participation in a hospital’s health analytic training set means that an individual was once a patient in that hospital. It has been shown that the success of membership inference attacks is highly related to the target model’s overfitting and sensitivity as to training data (Shokri et al., 2017; Yeom et al., 2018; Salem et al., 2019). Adversarially robust models aim to enhance the robustness of target models by ensuring that model predictions are unchanged for a small area (such as ball) around each training example. Intuitively, adversarially robust models have the potential to overfit on the training set and increase the model sensitivity, resulting in an enhanced risk of membership inference attacks. As an example, Figure 1 shows the histogram of crossentropy loss values of training data and test data for both naturally undefended and adversarially robust CIFAR10 classifiers provided by Madry et al. (Madry et al., 2018). We can see that members (training data) and nonmembers (test data) can be distinguished more easily for the robust model, compared to the natural model.
To measure the membership inference risks of adversarially robust models, besides the conventional inference method based on prediction confidence, we propose two new inference methods that exploit the structural properties of adversarial defenses. We measure the privacy risks of robust models trained with six stateoftheart adversarial defense methods, and find that adversarially robust models are indeed more susceptible to membership inference attacks than naturally undefended models. We further perform a comprehensive investigation to analyze the relation between privacy leakage and model properties, and study the effect of countermeasures such as temperature scaling and regularization.
In summary, we make the following contributions in this paper:

We propose two new membership inference attacks specific to adversarially robust models by exploiting adversarial examples’ predictions and verified worstcase predictions. With these two new methods, we can achieve higher inference accuraies than the conventional inference method based on prediction confidence of benign inputs.

We perform membership inference attacks on models trained with six stateoftheart adversarial defense methods (3 empirical defenses (Madry et al., 2018; Sinha et al., 2018; Zhang et al., 2019) and 3 verifiable defenses (Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018)). We demonstrate that all methods indeed increase the model’s membership inference risk. By defining the membership inference advantage as the increase in inference accuracy over random guessing (multiplied by 2) (Yeom et al., 2018), we show that robust machine learning models can incur a membership inference advantage , , times the membership inference advantage of naturally undefended models, on Yale Face, FashionMNIST, and CIFAR10 datasets, respectively.

We further explore the factors that influence the membership inference performance of the adversarially robust model, including its robustness generalization, the adversarial perturbation constraint, and the model capacity.

We study the effect of potential countermeasures, including temperature scaling and regularization, to reduce privacy leakage via membership inference attacks.
Some of our analysis was briefly discussed in a short workshop paper (Song et al., 2019). In this paper, we go further by proposing two new membership inference attacks and measuring four more adversarial defense methods, where we show that all adversarial defenses can increase the privacy risks of target models. We also perform a comprehensive investigation of factors that impact the privacy risks and discuss potential countermeasures.
2. Background and Related Work: Adversarial Examples and Membership Inference Attacks
In this section, we first present the background and related work on adversarial examples and defenses, and then discuss membership inference attacks.
2.1. Adversarial Examples and Defenses
Let be a machine learning model with input features and output classes, parameterized by weights . For an example with the input feature and the ground truth label
, the model outputs a prediction vector over all class labels
with, and the final prediction will be the label with the largest prediction probability
. For neural networks, the outputs of its penultimate layer are known as logits, and we represent them as a vector
. The softmax function is then computed on logits to obtain final prediction vector.(1) 
Given a training set , the natural training algorithm aims to make model predictions match ground truth labels by minimizing the prediction loss over all training examples.
(2) 
where denotes the size of training set, and
computes the prediction loss. A widelyadopted loss function is the crossentropy loss:
(3) 
where is the indicator function.
2.1.1. Adversarial examples:
Although machine learning models have achieved tremendous success in many classification scenarios, they have been found to be easily fooled by adversarial examples (Szegedy et al., 2014; Biggio et al., 2013; Goodfellow et al., 2015; Carlini and Wagner, 2017; Papernot et al., 2016). Adversarial examples induce incorrect classifications to target models, and can be generated via imperceptible perturbations to benign inputs.
(4) 
where denotes the set of points around within the perturbation budget of . Usually a ball is chosen as the perturbation constraint for generating adversarial examples i.e., . We consider the ball adversarial constraint throughout the paper, as it is widely adopted by most adversarial defense methods (Madry et al., 2018; Sinha et al., 2018; Zhang et al., 2019; Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018; Raghunathan et al., 2018).
The solution to Equation (4) is called an “untargeted adversarial example” as the adversarial goal is to achieve any incorrect classification. In comparison, a “targeted adversarial example” ensures that the model prediction is a specified incorrect label , which is not equal to .
(5) 
Unless otherwise specified, an adversarial example in this paper refers to an untargeted adversarial example.
To provide adversarial robustness under the perturbation constraint , instead of natural training algorithm shown in Equation (2), a robust training algorithm is adopted by adding an additional robust loss function.
(6) 
where is the ratio to trade off natural loss and robust loss, and measures the robust loss, which can be formulated as maximizing prediction loss under the constraint .
(7) 
can be same as or other appropriate loss functions.
However, it is usually hard to find the exact solution to Equation (7). Therefore, the adversarial defenses propose different ways to approximate the robust loss , which can be divided into two categories: empirical defenses and verifiable defenses.
2.1.2. Empirical defenses:
Empirical defense methods approximate robust loss values by generating adversarial examples at each training step with stateoftheart attack methods and computing their prediction loss. Now the robust training algorithm can be expressed as following.
(8) 
Three of our tested adversarial defense methods belong to this category, which are described as follows.
PGDBased Adversarial Training (PGDBased AdvTrain) (Madry et al., 2018): Madry et al. (Madry et al., 2018) propose one of the most effective empirical defense methods by using the projected gradient descent (PGD) method to generate adversarial examples for maximizing crossentropy loss () and training purely on those adversarial examples (). The PGD attack contains gradient descent steps, which can be expressed as
(9) 
where , , is the step size value, denotes the gradient computation, and means the projection onto the perturbation constraint .
Distributional Adversarial Training (DistBased AdvTrain) (Sinha et al., 2018): Instead of strictly satisfying the perturbation constraint with projection step as in PGD attacks, Sinha et al. (Sinha et al., 2018) generate adversarial examples by solving the Lagrangian relaxation of crossentropy loss:
(10) 
where is the penalty parameter for the distance. A multistep gradient descent method is adopted to solve Equation (10). The model will then be trained on the crossloss entropy () of adversarial examples only ().
Sinha et al. (Sinha et al., 2018) derive a statistical guarantee for distributional robustness with strict conditions requiring the loss function to be smooth on
, which are not satisfied in our setting. We mainly use widelyadopted ReLU activation functions for our machine learning models, which result in a nonsmooth loss function. Also, we generate adversarial examples with
distance penalties by using the algorithm proposed by Sinha et al. (Sinha et al., 2018) in Appendix E, where there is no robustness guarantee. Thus, we categorize the defense method as empirical.Differencebased Adversarial Training (DiffBased AdvTrain) (Zhang et al., 2019): Instead of using the crossentropy loss of adversarial examples, with insights from a toy binary classification task, Zhang et al. (Zhang et al., 2019) propose to use the difference (e.g., KullbackLeibler (KL) divergence) between the benign output and the adversarial output as the loss function , and combine it with natural cross entropy loss ().
(11) 
where computes the KL divergence. Adversarial examples are also generated with PGDbased attacks, except that now the attack goal is to maximize the output difference,
(12) 
2.1.3. Verifiable defenses:
Although empirical defense methods are effective against stateoftheart adversarial examples (Athalye et al., 2018), there is no guarantee for such robustness. To obtain a guarantee for robustness, verification approaches have been proposed to compute an upper bound of prediction loss under the adversarial perturbation constraint . If the input can still be predicted correctly in the verified worst case, then it is certain that there is no misclassification existing under .
Thus, verifiable defense methods take the verification process into consideration during training by using the verified worst case prediction loss as robust loss value . Now the robust training algorithm becomes
(13) 
where means verified upper bound computation of prediction loss under the adversarial perturbation constraint . In this paper, we consider the following three verifiable defense method.
DualityBased Verification (DualBased Verify) (Wong and Kolter, 2018): Wong and Kolter (Wong and Kolter, 2018) compute the verified worstcase loss by solving its dual problem with convex relaxation on nonconvex ReLU operations and then minimize this overapproximated robust loss values only . They further combine this duality relaxation method with the random projection technique to scale to more complex neural network architectures (Wong et al., 2018), like ResNet (He et al., 2016).
Abstract InterpretationBased Verification (AbsBased Verify) (Mirman et al., 2018): Mirman et al. (Mirman et al., 2018) leverage the technique of abstract interpretation to compute the worsecase loss: an abstract domain (such as interval domain, zonotope domain (Gehr et al., 2018)) is used to express the adversarial perturbation constraint at the input layer, and by applying abstract transformers on it, the maximum verified range of model output is obtained. They adopt a softplus function on the logits to compute the robust loss value and then combine it with natural training loss ()
(14) 
Interval Bound PropagationBased Verification (IBPBased Verify) (Gowal et al., 2018): Gowal et al. (Gowal et al., 2018) share a similar design as Mirman et al. (Mirman et al., 2018): they express the constraint as a bounded interval domain (one specified domain considered by Mirman et al. (Mirman et al., 2018)) and propagate this bound to the output layer. The robust loss is computed as a crossentropy loss of verified worsecase outputs () and then combined with natural prediction loss () as the final loss value during training.
2.2. Membership Inference Attacks
For a target machine learning model, the membership inference attacks aim to determine whether a given data point was used to train the model or not (Shokri et al., 2017; Yeom et al., 2018; Salem et al., 2019; Nasr et al., 2019; Long et al., 2017; Hayes et al., 2018). The attack poses a serious privacy risk to the individuals whose data is used for model training, for example in the setting of health analytics.
Shokri et al. (Shokri et al., 2017) design a membership inference attack method based on training an inference model to distinguish between predictions on training set members versus nonmembers. To train the inference model, they introduce the shadow training technique: (1) the adversary first trains multiple “shadow models” which simulate the behavior of the target model, (2) based on the shadow models’ outputs on their own training and test examples, the adversary obtains a labeled (member vs nonmember) dataset, and (3) finally trains the inference model as a neural network to perform membership inference attack against the target model. The input to the inference model is the prediction vector of the target model on a target data record.
A simpler inference model, such as a linear classifier, can also distinguish significantly vulnerable members from nonmembers. Yeom et al. (Yeom et al., 2018) suggest comparing the prediction confidence value of a target example with a threshold (learned for example through shadow training). Large confidence indicates membership. Their experiment results show that such a simple inference model is reasonably effective and achieve membership inference accuracy close to that of the shadow training method on target models. In this paper, we mainly follow this confidencethresholding membership inference approach.
3. Membership Inference Attacks against Robust Models
In this section, we first present some insights on why training models to be robust against adversarial examples make them more susceptible to membership inference attacks. We then formally present our membership inference attacks.
Throughout the paper, we use “natural (default) model” and “robust model” to denote the machine learning model with natural training algorithm and robust training algorithm, respectively. We also call the unmodified inputs and adversarially perturbed inputs as “benign examples” and “adversarial examples”. When evaluating the model’s classification performance, “train accuracy” and “test accuracy” are used to denote the classification accuracy of benign examples from training and test sets; “adversarial train accuracy’’ and “adversarial test accuracy” represent the classification accuracy of adversarial examples from training and test sets; “verified train accuracy” and “verified test accuracy” measure the classification accuracy under the verified worstcase predictions from training and test sets. Finally, an input example is called “secure” when it is correctly classified by the model for all adversarial perturbations within the constraint , “insecure” otherwise.
The performance of membership inference attacks is highly related to how much overfitted they are on their training data (Shokri et al., 2017; Yeom et al., 2018). An extremely simple attack algorithm can infer membership based on whether or not an input is correctly classified. In this case, it is clear that a large gap between the target model’s train and test accuracy leads to a significant membership inference attack accuracy (as most members are correctly classified, but not the nonmembers). Tsipras et al. (Tsipras et al., 2019) and Zhang et al. (Zhang et al., 2019) show that robust training might lead to a drop in test accuracy (i.e., more overfitting). This is shown based on both empirical and theoretical analysis on toy classification tasks. Moreover, the overfitting phenomenon can be worse for a robust model when evaluating its accuracy on adversarial examples (Song et al., 2019; Schmidt et al., 2018). Thus, compared with the natural models, the robust models might leak more membership information, due to exhibiting a larger generalization error, in both the benign or adversarial settings.
The performance of membership inference attack is related to the target model’s sensitivity with regard to training data (Long et al., 2017). The sensitivity measure is the influence of one data point on the target model’s performance by computing its prediction difference, when trained with and without this data point. Intuitively, when the target model is more sensitive on a training point, it is more likely that the membership inference adversary can guess its membership correctly. The robust training algorithms aim to ensure that model predictions remain unchanged for a small area (such as the ball) around any data point. However, in practice, they guarantee this for the training examples, thus, magnifying the influence of the training data on the model. Therefore, compared with the natural training, the robust training algorithms might make the model more susceptible to membership inference attacks, by increasing its sensitivity to its training data.
To validate the above insights, let’s take the natural and the robust CIFAR10 classifiers provided by Madry et al. (Madry et al., 2018) as an example. From Figure 1, we have seen that compared to the natural model, the robust model has a larger divergence between the prediction loss of training data and test data. Our finegrained analysis in Appendix A further reveals that the large divergence of robust model is highly related to its robustness performance. Moreover, the robust model incurs a significant generalization error in the adversarial setting, with adversarial train accuracy, and only adversarial test accuracy. Finally, we will experimentally show in Section 5.2.1 that the robust model is indeed more sensitive with regard to training data.
3.1. Membership Inference Performance
Symbol  Description 

Target machine learning model.  
Adversarial perturbation constraint when training a robust model.  
Model’s training set.  
Model’s test set.  
Benign (unmodified) input example.  
Ground truth label for the input .  
Adversarial example generated from .  
Robustness verification to compute verified worstcase predictions.  
Membership inference strategy.  
Membership inference accuracy.  
Membership inference advantage compared to random guessing. 
In this part, we describe the membership inference attack and its performance formally, with notations listed in Table 1. For a neural network model (we skip its parameter for simplicity) that is robustly trained with the adversarial constraint , the membership inference attack aims to determine whether a given input example is in its training set or not. We denote the inference strategy adopted by the adversary as , which codes members as 1, and nonmembers as 0.
We use the fraction of correct membership predictions, as the metric to evaluate membership inference accuracy. We use a test set which does not overlap with the training set, to represent nonmembers. We sample a random data point (, ) from either or with an equal probability, to test the membership inference attack. We measure the membership inference accuracy as follows.
(15) 
where measures the size of a dataset.
3.2. Exploiting the Model’s Predictions on Benign Examples
We adopt a confidencethresholding inference strategy due to its simplicity and effectiveness (Yeom et al., 2018): an input is inferred as member if its prediction confidence is larger than a preset threshold value. We denote this inference strategy as since it relies on the benign examples’ predictions. We have the following expressions for this inference strategy and its inference accuracy.
(17)  
where
is the indicator function and the last two terms are the values of complementary cumulative distribution functions of training examples’ and test examples’ prediction confidences, at the point of threshold
, respectively. In our experiments, we evaluate the worst case inference risks by choosing to achieve the highest inference accuracy, i.e., maximizing the gap between two complementary cumulative distribution function values. In practice, an adversary can learn the threshold via the shadow training technique (Shokri et al., 2017).This inference strategy does not leverage the adversarial constraint . In this paper, we propose two new membership inference strategies by taking into consideration, which are tailored to robust models.
3.3. Exploiting the Model’s Predictions on Adversarial Examples
Our first new inference strategy is to generate an (untargeted) adversarial example for input under the constraint , and use a threshold on the model’s prediction confidence on . We have following expression for this strategy and its inference accuracy.
(18)  
In our experiments, we use the PGD attack method shown in Equation (9) with the same perturbation constraint as in the robust training process to obtain . We choose the preset threshold to achieve highest inference accuracy, i.e., maximizing the gap between two complementary cumulative distribution functions of prediction confidence on adversarial train and test examples.
3.3.1. Targeted adversarial examples
We extend the attack to exploiting targeted adversarial examples. Targeted adversarial examples contain information about distance of the benign input to each label’s decision boundary, and are expected to leak more membership information than the untargeted adversarial example which only contains information about distance to a nearby label’s decision boundary.
We adapt the PGD attack method to find targeted adversarial examples (Equation (5)) by iteratively minizing the targeted crossentropy loss.
(19) 
The confidence thresholding inference strategy does not apply for targeted adversarial examples because there exist targeted adversarial examples (we have incorrect labels) for each input. Instead, following Shokri et al. (Shokri et al., 2017), we train a binary inference classifier for each class label to perform the membership inference attack. For each class label, we first choose a fraction of training and test points and generate corresponding targeted adversarial examples. Next, we compute model predictions on the targeted adversarial examples, and use them to train the membership inference classifier. Finally, we perform inference attacks using the remaining training and test points.
3.4. Exploiting the Verified WorstCase Predictions on Adversarial Examples
Our attacks above generate adversarial examples using the heuristic strategy of projected gradient descent. Next, we leverage verification techniques used by the verifiably defended models (Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018) to obtain the input’s worstcase predictions under the adversarial constraint . We use the input’s worstcase prediction confidence to predict its membership. The expressions for this strategy and its inference accuracy are as follows.
(20)  
where returns the verified worstcase prediction confidence for all examples satisfying the adversarial perturbation constraint , and is chosen in a similar manner as our previous two inference strategies.
Note that different verifiable defenses adopt different verification methods . Our inference strategy uses the same verification method which is used in the target model’s verifiably robust training process.
4. Experiment Setup
In this section, we describe the datasets, neural network architectures, and corresponding adversarial perturbation constraints that we use in our experiments. Throughout the paper, we focus on the perturbation constraint: . The detailed architectures are summarized in Appendix B. We will make our code publicly available to facilitate reproducible experiments and verification.
Yale Face.
The extended Yale Face database B is used to train face recognition models, and contains gray scale face images of
subjects under various lighting conditions (Georghiades et al., 2001; Lee et al., 2005). We use the cropped version of this dataset, where all face images are aligned and cropped to have the dimension of . In this version, each subject has images with the same frontal poses under different lighting conditions, among which images were corrupted during the image acquisition, leading to 2,414 images in total (Lee et al., 2005). In our experiments, we select images for each subject to form the training set (total size is 1,900 images), and use the remaining 514 images as the test set.For the model architecture, we use a convolutional neural network (CNN) with the convolution kernel size
, as suggested by Simonyan et al. (Simonyan and Zisserman, 2015). The CNN model contains 4 blocks with different numbers of output channels, and each block contains two convolution layers. The first layer uses a stride of
for convolutions, and the second layer uses a stride of . There are two fully connected layers after the convolutional layers, each containing and neurons. When training the robust models, we set the perturbation budget () to be .FashionMNIST. This dataset consists of a training set of 60,000 examples and a test set of 10,000 examples (Xiao et al., 2017). Each example is a grayscale image, associated with a class label from 10 fashion products, such as shirt, coat, sneaker.
Similar to Yale Face, we also adopt a CNN architecture with the convolution kernel size . The model contains 2 blocks with output channel numbers , and each block contains three convolution layers. The first two layers both use a stride of , while the last layer uses a stride of . Two fully connected layers are added at the end, with and neurons, respectively. When training the robust models, we set the perturbation budget () to be .
CIFAR10. This dataset is composed of color images in 10 classes, with 6,000 images per class. In total, there are 50,000 training images and 10,000 test images.
We use the wide ResNet architecture (Zagoruyko and Komodakis, 2016) to train a CIFAR10 classifier, following Madry et al. (Madry et al., 2018). It contains 3 groups of residual layers with output channel numbers (160, 320, 640) and 5 residual units for each group. Similarly, two fully connected layer with and neurons are added at the end. When training the robust models, we set the perturbation budget () to be .
5. Membership Inference Attacks against Empirically Robust Models
In this section we discuss membership inference attacks against 3 empirical defense methods: PGDbased adversarial training (PGDBased AdvTrain) (Madry et al., 2018), distributional adversarial training (DistBased AdvTrain) (Sinha et al., 2018), and differencebased adversarial training (DiffBased AdvTrain) (Zhang et al., 2019). We train the robust models against the adversarial constraint on the Yale Face dataset, the FashionMNIST dataset, and the CIFAR10 dataset, with neural network architecture as described in Section 4. Following previous work (Athalye et al., 2018; Madry et al., 2018; Wong and Kolter, 2018; Mirman et al., 2018), the perturbation budget values are set to be , , and on three datasets, respectively. For the empirically robust model, as explained in Section 2.1, there is no verification process to obtain robustness guarantee. Thus the membership inference strategy does not apply here.
We first present an overall analysis that compares membership inference accuracy for natural models and robust models using multiple inference strategies across multiple datasets. We then present a deeper analysis of membership inference attacks against the PGDbased adversarial training defense.
5.1. Overall Results
Training  train  test  advtrain  advtest  inference  inference 
method  acc  acc  acc  acc  acc ()  acc () 
Natural  100%  98.25%  4.95%  4.09%  55.85%  54.53% 
PGDBased  99.89%  96.69%  99.05%  77.63%  61.69%  68.86% 
AdvTrain (Madry et al., 2018)  
DistBased  99.58%  93.77%  84.16%  56.23%  62.23%  64.25% 
AdvTrain (Sinha et al., 2018)  
DiffBased  99.53%  93.77%  99.42%  83.85%  58.06%  65.73% 
AdvTrain (Zhang et al., 2019) 
Training  train  test  advtrain  advtest  inference  inference 
method  acc  acc  acc  acc  acc ()  acc () 
Natural  100%  92.18%  4.29%  4.11%  57.12%  50.99% 
PGDBased  99.93%  90.88%  97.53%  68.74%  58.32%  64.42% 
AdvTrain (Madry et al., 2018)  
DistBased  97.98%  90.62%  70.22%  53.13%  57.35%  59.91% 
AdvTrain (Sinha et al., 2018)  
DiffBased  99.35%  90.92%  90.57%  72.71%  57.02%  58.91% 
AdvTrain (Zhang et al., 2019) 
Training  train  test  advtrain  advtest  inference  inference 
method  acc  acc  acc  acc  acc ()  acc () 
Natural  100%  95.01%  0%  0%  57.43%  50.85% 
PGDBased  99.99%  87.25%  96.07%  46.59%  74.89%  75.65% 
AdvTrain (Madry et al., 2018)  
DistBased  100%  90.10%  43.61%  27.34%  67.16%  64.69% 
AdvTrain (Sinha et al., 2018)  
DiffBased  99.50%  87.99%  77.01%  47.28%  61.18%  67.08% 
AdvTrain (Zhang et al., 2019) 
The membership inference attack results against natural models and empirically robust models (Madry et al., 2018; Sinha et al., 2018; Zhang et al., 2019) are presented in Table 2, Table 3 and Table 4, where “acc” stands for accuracy, while “advtrain acc” and “advtest acc” report adversarial accuracy under PGD attacks as shown in Equation (9).
According to these results, all three empirical defense methods will make the model more susceptible to membership inference attacks: compared with natural models, robust models increase the membership inference advantage by up to , , and , for Yale Face, FashionMNIST, and CIFAR10, respectively.
We also find that for robust models, membership inference attacks based on adversarial example’s prediction confidence () have higher inference accuracy than the inference attacks based on benign example’s prediction confidence () in most cases. On the other hand, for natural models, inference attacks based on benign examples’ prediction confidence lead to higher inference accuracy values. This happens because our inference strategies rely on the difference between confidence distribution of training points and that of test points. For robust models, most of training points are (empirically) secure against adversarial examples, and adversarial perturbations do not significantly decrease the confidence on them. However, the test set contains more insecure points, and thus adversarial perturbations will enlarge the gap between confidence distributions of training examples and test examples, leading to a higher inference accuracy. For natural models, the use of adversarial examples will decrease the confidence distribution gap, since almost all training points and test points are not secure with adversarial perturbations. The only exception is DistBased AdvTrain CIFAR10 classifier, where inference accuracy with strategy is higher, which can be explained by the poor robustness performance of the model: more than training examples are insecure. Thus, adversarial perturbations will decrease the confidence distribution gap between training examples and test examples in this specific scenario.
5.2. Detailed Membership Inference Analysis of PGDBased Adversarial Training
In this part, we perform a detailed analysis of membership inference attacks against PGDbased adversarial training defense method (Madry et al., 2018) by using the CIFAR10 classifier as an example. We first perform a sensitivity analysis on both natural and robust models to show that the robust model is more sensitive with regard to training data compared to the natural model. We then investigate the relation between privacy leakage and model properties, including robustness generalization, adversarial perturbation constraint and model capacity. We finally show that the predictions of targeted adversarial examples can further enhance the membership inference advantage.
5.2.1. Sensitivity Analysis
In the sensitivity analysis, we remove sample CIFAR10 training points from the training set, perform retraining of the models, and compute the performance difference between the original model and retrained model.
We excluded 10 training points (one for each class label) and retrained the model. We computed the sensitivity of each excluded point as the difference between its prediction confidence in the retrained model and the original model. We obtained the sensitivity metric for 60 training points by retraining the classifier 6 times. Figure 2 depicts the sensitivity values for the 60 training points (in ascending order) for both robust and natural models. We can see that compared to the natural model, the robust model is indeed more sensitive to the training data, thus leaking more membership information.
5.2.2. Privacy risk with robustness generalization
We perform the following experiment to demonstrate the relation between privacy risk and robustness generalization. Recall that in the approach of Madry et al. (Madry et al., 2018), adversarial examples are generated from all training points during the robust training process. In our experiment, we modify the above defense approach to (1) leverage adversarial examples from a subset of the CIFAR10 training data to compute the robust prediction loss, and (2) leverage the remaining subset of training points as benign inputs to compute the natural prediction loss.
Advtrain  train  test  advtrain  advtest  inference  inference 
ratio  acc  acc  acc  acc  acc ()  acc () 
0  100%  95.01%  0%  0%  57.43%  50.85% 
1/2  100%  87.78%  75.85%  43.23%  67.20%  66.36% 
3/4  100%  86.68%  88.34%  45.66%  71.07%  72.22% 
1  99.99%  87.25%  96.07%  46.59%  74.89%  75.65% 
The membership inference attack results are summarized in Table 5, where the first column lists the ratio of training points used for computing robust loss. We can see that as more training points are used for computing the robust loss, the membership inference accuracy increases, due to the larger gap between advtrain accuracy and advtest accuracy.
5.2.3. Privacy risk with model perturbation budget
Next, we explore the relationship between membership inference and the adversarial perturbation budget , which controls the maximum absolute value of adversarial perturbations during robust training process.
Perturbation  train  test  advtrain  advtest  inference  inference 
budget ()  acc  acc  acc  acc  acc ()  acc () 
2/255  100%  93.74%  99.99%  82.20%  64.48%  66.54% 
4/255  100%  91.19%  99.89%  70.03%  69.44%  72.43% 
8/255  99.99%  87.25%  96.07%  46.59%  74.89%  75.65% 
We performed the robust training (Madry et al., 2018) for three CIFAR10 classifiers with varying adversarial perturbation budgets, and show the result in Table 6. Note that a model trained with a larger is more robust since it can defend against larger adversarial perturbations. From Table 6, we can see that more robust models leak more information about the training data. With a larger value, the robust model relies on a larger ball around each training point, leading to a higher membership inference attack accuracy.
5.2.4. Privacy risk with model capacity
Madry et al. (Madry et al., 2018) have observed that compared with natural training, robust training requires a significantly larger model capacity (e.g., deeper neural network architectures and more convolution filters) to obtain high robustness. In fact, we can think of the robust training approach as adding more “virtual training points”, which are within the ball around original training points. Thus the model capacity needs to be large enough to fit well on the larger “virtual training set”.
Here we investigate the influence of model capacity by varying the capacity scale of wide ResNet architecture (Zagoruyko and Komodakis, 2016) used in CIFAR10 training, which is proportional to the output channel numbers of residual layers. We perform membership inference attacks for the robust models, and show the results in Figure 3. The attacks are based on benign inputs’ predictions (strategy ) and the gray line measures the privacy leakage for the natural models as a baseline.
First, we can see that as the model capacity increases, the model has a higher membership inference accuracy, along with a higher adversarial train accuracy. Second, when using a larger adversarial perturbation budget , a larger model capacity is also needed. When , a capacity scale of 2 is enough to fit the training data, while for , a capacity scale of 8 is needed.
5.2.5. Inference attacks using targeted adversarial examples
Next, we investigate membership inference attacks using targeted adversarial examples. For each input, we compute 9 targeted adversarial examples with each of the 9 incorrect labels as targets using Equation (19). We then compute the output prediction vectors for all adversarial examples and use the shadowtraining inference method proposed by Shokri et al. (Shokri et al., 2017) to perform membership inference attacks. Specifically, for each class label, we learn a dedicated inference model (binary classifier) by using the output predictions of targeted adversarial examples from training points and test points as the training set for the membership inference. We then test the inference model on the remaining CIFAR10 training and test examples from the same class label. In our experiments, we use a 3layer fully connected neural network with size of hidden neurons equal to 200, 20, and 2 respectively. We call this method “modelinfer (targeted)”.
For untargeted adversarial examples or benign examples, a similar class labeldependent inference model can also be obtained by using either untargeted adversarial example’s prediction vector or benign example’s prediction vector as features of the inference model. We call these methods “modelinfer (untargeted)” and “modelinfer (benign)”. We use the same 3layer fully connected neural network as the inference classifier.
Finally, we also adapt our confidencethresholding inference strategy to be classlabel dependent by choosing the confidence value according to prediction confidence values from training points and test points, and then testing on remaining CIFAR10 points from the same class label. Based on whether the confidence value is from the untargeted adversarial input or the benign input, we call the method as “confidenceinfer (untargeted)” and “confidenceinfer (benign)”.
Class  confidenceinfer  modelinfer  confidenceinfer  modelinfer  modelinfer 

label  (benign)  (benign)  (untargeted)  (untargeted)  (targeted) 
0  70.88%  71.49%  72.21%  72.70%  74.42% 
1  63.57%  64.42%  67.52%  67.69%  68.88% 
2  80.16%  76.74%  79.71%  80.16%  83.58% 
3  90.43%  90.49%  87.64%  87.83%  90.57% 
4  82.30%  82.17%  81.83%  81.57%  84.47% 
5  81.34%  79.84%  81.57%  81.34%  83.02% 
6  75.34%  70.92%  77.66%  76.97%  79.94% 
7  69.54%  67.61%  72.92%  72.82%  72.98% 
8  69.16%  69.57%  74.36%  74.40%  75.33% 
9  68.13%  66.34%  71.86%  72.06%  73.32% 
The membership inference attack results using the above five strategies are presented in Table 7. We can see that the targeted adversarial example based inference strategy “modelinfer (targeted)” always has the highest inference accuracy. This is because the targeted adversarial examples contain information about distance of the input to each label’s decision boundary, while untargeted adversarial examples contain information about distance of the input to only a nearby label’s decision boundary. Thus targeted adversarial examples leak more membership information. As an aside, we also find that our confidencebased inference methods obtain nearly the same inference results as training neural network models, showing the effectiveness of the confidencethresholding inference strategies.
Training  train  test  advtrain  advtest  vertrain  vertest  inference  inference  inference 
method  acc  acc  acc  acc  acc  acc  acc ()  acc ()  acc () 
Natural  100%  98.25%  4.95%  4.09%  N.A.  N.A.  55.85%  54.53%  N.A. 
DualBased  98.89%  92.80%  98.53%  83.66%  96.37%  68.87%  55.90%  60.06%  64.48% 
Verify (Wong and Kolter, 2018)  
AbsBased  99.26%  83.27%  86.21%  50.97%  43.32%  18.09%  65.11%  65.70%  67.05% 
Verify (Mirman et al., 2018)  
IBPBased  99.16%  85.80%  94.16%  69.26%  89.58%  36.77%  60.45%  66.25%  76.05% 
Verify (Gowal et al., 2018) 
Training  train  test  advtrain  advtest  vertrain  vertest  inference  inference  inference 
method  acc  acc  acc  acc  acc  acc  acc ()  acc ()  acc () 
Natural  100%  92.18%  4.29%  4.11%  N.A.  N.A.  57.12%  50.99%  N.A. 
DualBased  75.13%  74.29%  65.78%  65.37%  61.77%  61.45%  50.58%  50.43%  50.45% 
Verify (Wong and Kolter, 2018)  
AbsBased  86.44%  85.47%  74.22%  73.30%  69.69%  68.89%  50.79%  50.70%  50.59% 
Verify (Mirman et al., 2018)  
IBPBased  89.85%  86.26%  82.72%  78.57%  79.20%  74.17%  52.13%  52.01%  52.67% 
Verify (Gowal et al., 2018) 
6. Membership Inference Attacks against Verifiably Robust Models
In this section we perform membership inference attacks against 3 verifiable defense methods: dualitybased verification (DualBased Verify) (Wong and Kolter, 2018), abstract interpretationbased verification (AbsBased Verify) (Mirman et al., 2018), and interval bound propagationbased verification (IBPBased Verify) (Gowal et al., 2018). We train the verifiably robust models using the network architectures as described in Section 4 (with minor modifications for the DualBased Verify method (Wong and Kolter, 2018) as discussed in Appendix C), the perturbation budget is set to be for the Yale Face dataset and for the FashionMNIST dataset. We do not evaluate the verifiably robust models for the full CIFAR10 dataset as none of these three defense methods scale to the wide ResNet architecture.
6.1. Overall Results
The membership inference attack results against natural and verifiably robust models are presented in Table 8 and Table 9, where “acc” stands for accuracy, “advtrain acc” and “advtest acc” measure adversarial accuracy under PGD attacks (Equation (9)), and “vertrain acc” and “vertest acc” report the verified worsecase accuracy under the perturbation constraint .
For the Yale Face dataset, all three defense methods leak more membership information. The IBPBased Verify method even leads to an inference accuracy above , higher than the inference accuracy of empirical defenses shown in Table 2, resulting a membership inference advantage (Equation (16)) than the natural model. The inference strategy based on verified prediction confidence (strategy ) has the highest inference accuracy as the verification process enlarges prediction confidence between training data and test data.
On the other hand, for the FashionMNIST dataset, we fail to obtain increased membership inference accuracies on the verifiably robust models. However, we also observe much reduced benign train accuracy (below 90%) and verified train accuracy (below 80%), which means that the model fits the training set poorly. Similar to our analysis of empirical defenses, we can think the verifiable defense as adding more “virtual training points” around each training example to compute its verified robust loss. Since the verified robust loss is an upper bound on the real robust loss, the added “virtual training points” are in fact beyond the ball. Therefore, the model capacity needed for verifiable defenses is even larger than that of empirical defense methods.
From the experiment results in Section 5.2.4, we have shown that if the model capacity is not large enough, the robust model will not fit the training data well. This explains why membership inference accuracies for verifiably robust models are limited in Table 9. However, enlarging the model capacity does not guarantee that the training points will fit well for verifiable defenses because the verified upper bound of robust loss is likely to be looser with a deeper and larger neural network architecture. We validate our hypothesis in the following two subsections.
6.2. Varying Model Capacities
We use models with varying capacities to robustly train on the Yale Face dataset with the IBPBased Verify defense (Gowal et al., 2018) as an example.
We present the results in Figure 4, where model capacity scale of corresponds to the original model architecture, and we perform membership inference attacks based on verified worstcase prediction confidence . We can see that when model capacity increases, at the beginning, robustness performance gets improved, and we also have a higher membership inference accuracy. However, when the model capacity is too large, the robustness performance and the membership inference accuracy begin decreasing, since now the verified robust loss becomes too loose.
6.3. Reducing the Size of Training Set
Training  Perturbation  train  test  advtrain  advtest  vertrain  vertest  inference  inference  inference 
method  budgets ()  acc  acc  acc  acc  acc  acc  acc ()  acc ()  acc () 
Natural  N.A.  99.83%  71.8%  N.A.  N.A.  N.A.  N.A.  71.50%  N.A.  N.A. 
DualBased  0.25/255  100%  73.1%  99.99%  69.8%  99.99%  68.18%  76.13%  76.16%  76.04% 
Verify (Wong and Kolter, 2018)  
DualBased  0.5/255  99.98%  69.29%  99.98%  64.51%  99.97%  60.89%  77.06%  77.35%  77.09% 
Verify (Wong and Kolter, 2018)  
DualBased  0.75/255  100%  65.25%  99.95%  59.46%  99.85%  54.71%  77.99%  78.50%  78.20% 
Verify (Wong and Kolter, 2018)  
DualBased  1/255  99.78%  63.96%  99.44%  57.06%  98.61%  50.74%  76.30%  77.05%  77.16% 
Verify (Wong and Kolter, 2018)  
DualBased  1.25/255  98.46%  61.79%  97.30%  53.76%  95.36%  46.70%  74.07%  75.10%  75.41% 
Verify (Wong and Kolter, 2018)  
DualBased  1.5/255  96.33%  60.97%  94.27%  51.72%  90.19%  44.23%  71.08%  72.29%  72.69% 
Verify (Wong and Kolter, 2018) 
In this subsection, we further prove our hypothesis by showing that when the size of the training set is reduced so that the model can fit well on the reduced dataset, the verifiable defense method indeed leads to an increased membership inference accuracy.
We choose the dualitybased verifiable defense method (Wong and Kolter, 2018; Wong et al., 2018) and train the CIFAR10 classifier with a normal ResNet architecture: 3 groups of residual layers with output channel numbers (16, 32, 64) and only 1 residual unit for each group. The whole CIFAR10 training set have too many points to be robustly fitted with the verifiable defense algorithm: the robust CIFAR10 classifier (Wong et al., 2018) with has the train accuracy below . Therefore, we select a subset of the training data to robustly train the model by randomly choosing () training images for each class label. We vary the perturbation budget value () in order to observe when the model capacity is not large enough to fit on this partial CIFAR10 set using the verifiable training algorithm (Wong and Kolter, 2018).
We show the obtained results in Table 10, where the natural model has a low test accuracy (below 75%) and high privacy leakage (inference accuracy is ) since we only use training examples to learn the classifier. By using the verifiable defense method (Wong and Kolter, 2018), the verifiably robust models have increased membership inference accuracy values, for all values. We can also see that when increasing the values, at the beginning, the robust model is more and more susceptible to membership inference attacks (inference accuracy increases from to ). However, beyond a threshold of , the inference accuracy starts to decrease, since a higher requires a model with a larger capacity to fit well on the training data.
7. Discussion: Potential Countermeasures
In this section, we discuss potential countermeasures that can reduce the risk of membership inference attacks while maintaining model robustness.
7.1. Temperature Scaling
Our membership inference strategies leverage the difference between the prediction confidence of the target model on its training set and test set. Thus, a straightforward mitigation method is to reduce this difference by applying temperature scaling on logits (Guo et al., 2017). The temperature scaling method was shown to be effective to reduce privacy risk for natural (baseline) models by Shokri et al. (Shokri et al., 2017), while we are studying its effect for robust models here.
Temperature scaling is a postprocessing calibration technique for machine learning models that divides logits by the temperature, , before the softmax function. Now the model prediction probability can be expressed as
(21) 
where corresponds to original model prediction. By setting , the prediction confidence is reduced, and when , the prediction output is close to uniform and independent of the input, thus leaking no membership information while making the model useless for prediction.
We apply the temperature scaling technique on the robust Yale Face and FashionMNIST classifiers using the PGDbased adversarial training defense (Madry et al., 2018) and investigate its effect on membership inference. We present the membership inference results for varying temperature values (while maintaining the same classification accuracy) in Figure 5. We can see that increasing the temperature value decreases the membership inference accuracy.
7.2. Regularization to Improve Robustness Generalization
Regularization techniques such as parameter norm penalties and dropout (Srivastava et al., 2014), are typically used during the training process to solve overfitting issues for machine learning models. Shokri et al. (Shokri et al., 2017) and Salem et al. (Salem et al., 2019) validate their effectiveness against membership inference attacks. Furthermore, Nasr et al. (Nasr et al., 2018) propose to measure the performance of membership inference attack at each training step and use the measurement as a new regularizer.
The above mitigation strategies are effective regardless of natural or robust machine learning models. For the robust models, we can also rely on the regularization approach, which improves the model’s robustness generalization. This can mitigate membership inference attacks, since a poor robustness generalization leads to a severe privacy risk. We study the method proposed by Song et al. (Song et al., 2019) to improve model’s robustness generalization and explore its performance against membership inference attacks.
The regularization method in (Song et al., 2019) performs domain adaptation (DA) (Torralba et al., 2011)
for the benign examples and adversarial examples on the logits: two multivariate Gaussian distributions for the logits of benign examples and adversarial examples are computed, and
distances between two mean vectors and two covariance matrices are added into the training loss.Dataset  using  train  test  advtrain  advtest  inference  inference 
DA (Song et al., 2019)?  acc  acc  acc  acc  acc ()  acc ()  
Yale Face  no  99.89%  96.69%  99.05%  77.63%  61.69%  68.86% 
Yale Face  yes  99.32%  94.75%  99.26%  88.52%  60.73%  63.14% 
Fashion  no  99.93%  90.88%  97.53%  68.74%  58.32%  64.42% 
MNIST  
Fashion  yes  88.97%  86.98%  81.59%  78.65%  51.19%  51.49% 
MNIST 
We apply this DAbased regularization approach on the PGDbased adversarial training defense (Madry et al., 2018) to investigate its effectiveness against membership inference attacks. We list the experimental results both with and without the use of DA regularization for Yale Face and FashionMNIST datasets in Table 11. We can see that the DAbased regularization can decrease the gap between adversarial train accuracy and adversarial test accuracy (robust generalization error), leading to a reduction in membership inference risk.
8. Conclusions
In this paper, we have connected both the security domain and the privacy domain for machine learning systems by investigating the membership inference privacy risk of robust training approaches (that mitigate the adversarial examples). To evaluate the membership inference risk, we propose two new inference methods that exploit structural properties of adversarially robust defenses, beyond the conventional inference method based on the prediction confidence of benign input. By measuring the success of membership inference attacks on robust models trained with six stateoftheart adversarial defense approaches, we find that all six robust training methods will make the machine learning model more susceptible to membership inference attacks, compared to the naturally undefended training. Our analysis further reveals that the privacy leakage is related to target model’s robustness generalization, its adversarial perturbation constraint, and its capacity. We also study the effect of potential countermeasures against membership inference attacks, including temperature scaling and regularization to improve robustness generalization. The detailed analysis in our paper highlights the importance of thinking about security and privacy together. Specifically, the membership inference risk needs to be considered when designing approaches to defend against adversarial examples.
9. Acknowledgments
This work was supported in part by the National Science Foundation under grants CNS1553437, CNS1704105, CIF1617286 and EARS1642962, by the Office of Naval Research Young Investigator Award, by the Army Research Office Young Investigator Prize, by Faculty research awards from Intel and IBM, and by the National Research Foundation, Prime Ministers Office, Singapore under its Strategic Capability Research Centres Funding Initiative.
References
 (1)
 Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In ACM Conference on Computer and Communications Security (CCS). 308–318.
 Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transitionbased neural networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2442–2452.
 Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning (ICML).
 Biggio et al. (2013) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 387–402.

Biggio
et al. (2012)
Battista Biggio, Blaine
Nelson, and Pavel Laskov.
2012.
Poisoning attacks against support vector machines. In
International Conference on Machine Learning (ICML). 1467–1474. 
Biggio and Roli (2018)
Battista Biggio and
Fabio Roli. 2018.
Wild patterns: Ten years after the rise of adversarial machine learning.
Pattern Recognition 84 (2018), 317–331.  Carlini and Wagner (2017) Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (S&P). 39–57.
 Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.
 Deng et al. (2013) Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8599–8603.
 Fredrikson et al. (2015) Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model inversion attacks that exploit confidence information and basic countermeasures. In ACM Conference on Computer and Communications Security (CCS). 1322–1333.
 Fredrikson et al. (2014) Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. 2014. Privacy in Pharmacogenetics: An EndtoEnd Case Study of Personalized Warfarin Dosing.. In USENIX Security Symposium. 17–32.
 Ganju et al. (2018) Karan Ganju, Qi Wang, Wei Yang, Carl A Gunter, and Nikita Borisov. 2018. Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations. In ACM Conference on Computer and Communications Security (CCS). 619–633.
 Gehr et al. (2018) Timon Gehr, Matthew Mirman, Dana DrachslerCohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. 2018. AI: Safety and robustness certification of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy (S&P). 3–18.
 Georghiades et al. (2001) Athinodoros S Georghiades, Peter N Belhumeur, and David J Kriegman. 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis & Machine Intelligence 6 (2001), 643–660.
 Goodfellow et al. (2015) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations (ICLR).
 Gowal et al. (2018) Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Timothy Mann, and Pushmeet Kohli. 2018. On the effectiveness of interval bound propagation for training verifiably robust models. In NeurIPS Workshop on Security in Machine Learning (SECML).
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. International Conference on Machine Learning (ICML).
 Hayes et al. (2018) J Hayes, L Melis, G Danezis, and E De Cristofaro. 2018. LOGAN: Membership Inference Attacks Against Generative Models. In Proceedings on Privacy Enhancing Technologies (PoPETs).
 Hayes and Ohrimenko (2018) Jamie Hayes and Olga Ohrimenko. 2018. Contamination Attacks and Mitigation in MultiParty Machine Learning. In Conference on Neural Information Processing Systems (NeurIPS). 6602–6614.

He
et al. (2016)
Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun.
2016.
Deep residual learning for image recognition. In
IEEE conference on computer vision and pattern recognition (CVPR)
. 770–778.  Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.

Huang et al. (2011)
Ling Huang, Anthony D
Joseph, Blaine Nelson, Benjamin IP
Rubinstein, and JD Tygar.
2011.
Adversarial machine learning. In
ACM Workshop on Artificial Intelligence and Security (AISec)
. 43–58.  Jagielski et al. (2018) Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina NitaRotaru, and Bo Li. 2018. Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In IEEE Symposium on Security and Privacy (S&P).
 Kesarwani et al. (2018) Manish Kesarwani, Bhaskar Mukhoty, Vijay Arya, and Sameep Mehta. 2018. Model extraction warning in mlaas paradigm. In Proceedings of the 34th Annual Computer Security Applications Conference (ACSAC). ACM, 371–380.
 Koh and Liang (2017) Pang Wei Koh and Percy Liang. 2017. Understanding blackbox predictions via influence functions. In International Conference on Machine Learning (ICML). 1885–1894.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems (NeurIPS). 1097–1105.
 Lee et al. (2005) KuangChih Lee, Jeffrey Ho, and David J Kriegman. 2005. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis & Machine Intelligence 5 (2005), 684–698.
 Lee et al. (2019) Taesung Lee, Benjamin Edwards, Ian Molloy, and Dong Su. 2019. Defending Against Model Stealing Attacks Using Deceptive Perturbations. In Deep Learning and Security Workshop (DLS).
 Long et al. (2017) Yunhui Long, Vincent Bindschaedler, and Carl A Gunter. 2017. Towards Measuring Membership Privacy. arXiv preprint arXiv:1712.09136 (2017).
 Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR).
 Mirman et al. (2018) Matthew Mirman, Timon Gehr, and Martin Vechev. 2018. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning. 3575–3583.
 Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. Deepstack: Expertlevel artificial intelligence in headsup nolimit poker. Science 356, 6337 (2017), 508–513.
 Nasr et al. (2018) Milad Nasr, Reza Shokri, and Amir Houmansadr. 2018. Machine learning with membership privacy using adversarial regularization. In ACM Conference on Computer and Communications Security (CCS).
 Nasr et al. (2019) Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019. Comprehensive Privacy Analysis of Deep Learning: Passive and Active Whitebox Inference Attacks against Centralized and Federated Learning. In IEEE Symposium on Security and Privacy (S&P).
 Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. 2016. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy (EuroS&P). 372–387.
 Papernot et al. (2018) Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. 2018. SoK: Security and Privacy in Machine Learning. In IEEE European Symposium on Security and Privacy (EuroS&P). 399–414.
 Raghunathan et al. (2018) Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. 2018. Certified defenses against adversarial examples. In International Conference on Learning Representations (ICLR).
 Salem et al. (2019) Ahmed Salem, Yang Zhang, Mathias Humbert, Mario Fritz, and Michael Backes. 2019. MLleaks: Model and data independent membership inference attacks and defenses on machine learning models. In Network and Distributed Systems Security Symposium (NDSS).
 Schmidt et al. (2018) Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. 2018. Adversarially Robust Generalization Requires More Data. In Conference on Neural Information Processing Systems (NeurIPS).
 Shafahi et al. (2018) Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. 2018. Poison Frogs! Targeted CleanLabel Poisoning Attacks on Neural Networks. In Conference on Neural Information Processing Systems (NeurIPS).
 Shokri and Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. 2015. Privacypreserving deep learning. In ACM Conference on Computer and Communications Security (CCS). 1310–1321.
 Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (S&P). 3–18.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484.
 Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR).
 Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John Duchi. 2018. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations (ICLR).
 Song et al. (2019) Chuanbiao Song, Kun He, Liwei Wang, and John E Hopcroft. 2019. Improving the Generalization of Adversarial Training with Domain Adaptation. In International Conference on Learning Representations (ICLR).
 Song et al. (2017) Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. 2017. Machine learning models that remember too much. In ACM Conference on Computer and Communications Security (CCS). 587–601.
 Song et al. (2019) Liwei Song, Reza Shokri, and Prateek Mittal. 2019. Membership inference attacks against adversarially robust deep learning models. In Deep Learning and Security Workshop (DLS).
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
 Steinhardt et al. (2017) Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. 2017. Certified defenses for data poisoning attacks. In Conference on Neural Information Processing Systems (NeurIPS). 3517–3529.
 Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR).
 Torralba et al. (2011) Antonio Torralba, Alexei A Efros, et al. 2011. Unbiased look at dataset bias.. In IEEE conference on computer vision and pattern recognition (CVPR).
 Tramèr et al. (2016) Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. 2016. Stealing Machine Learning Models via Prediction APIs.. In USENIX Security Symposium. 601–618.

Tsipras et al. (2019)
Dimitris Tsipras, Shibani
Santurkar, Logan Engstrom, Alexander
Turner, and Aleksander Madry.
2019.
Robustness may be at odds with accuracy. In
International Conference on Learning Representations (ICLR).  Wang and Gong (2018) Binghui Wang and Neil Zhenqiang Gong. 2018. Stealing hyperparameters in machine learning. In IEEE Symposium on Security and Privacy (S&P).
 Wong and Kolter (2018) Eric Wong and Zico Kolter. 2018. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning (ICML). 5283–5292.
 Wong et al. (2018) Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. 2018. Scaling provable adversarial defenses. In Conference on Neural Information Processing Systems (NeurIPS).
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
 Yeom et al. (2018) Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In IEEE Computer Security Foundations Symposium (CSF). 268–282.
 Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC).
 Zhang et al. (2019) Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. 2019. Theoretically principled tradeoff between robustness and accuracy. In International Conference on Machine Learning (ICML).
Appendix A FineGrained Analysis of Prediction Loss of the Robust CIFAR10 Classifier
Here, we perform a finegrained analysis of Figure 0(a) by separately visualizing the prediction loss distributions for test points which are secure and test points which are insecure. A point is deemed as secure when it is correctly classified by the model for all adversarial perturbations within the constraint .
Note that only a few training points were not secure, so we focused our finegrained analysis on the test set. Figure 6 shows that insecure test inputs are very likely to have large prediction loss (low confidence value). Our membership inference strategies directly use the confidence to determine membership, so the privacy risk has a strong relationship with robustness generalization, even when we purely rely on the prediction confidence of the benign unmodified input.
Appendix B Model Architecture
We present the detailed neural network architectures in Table 12.
Yale Face  FashionMNIST  CIFAR10 

Conv  Conv  Conv 
Conv  Conv  Res 1605 
Conv  Conv  Res 3205 
Conv  Conv  Res 6405 
Conv  Conv  FC 200 
Conv  Conv  FC 10 
Conv  FC 200  
Conv  FC 10  
FC 200  
FC 38 
Appendix C Experiment Modifications for the DualityBased Verifiable Defense
When dealing with the dualitybased verifiable defense method (Wong and Kolter, 2018; Wong et al., 2018)
(implemented in PyTorch), we find that the convolution with a kernel size
and a stride of as described in Section 4is not applicable. The defense method works by backpropagating the neural network to express the dual problem, while the convolution with a kernel size
and a stride of prohibits their backpropagation analysis as the computation of output size is not divisible by 2 (PyTorch uses a round down operation).For the same reason, we also need to change the dimension of the Yale Face input to be
by adding zero paddings. In our experiments, we have validated that the natural models trained with the above modifications have similar accuracy and privacy performance as the natural models without modifications reported in Table
8 and Table 9.
Comments
There are no comments yet.