Deep generative models have been widely studied recently as an effective method to abstract and approximate a data distribution. Two of the most commonly used and efficient approaches are Variational Autoencoders (VAEs) [Kingma and Welling2013a] and Generative Adversarial Networks (GANs) [Goodfellow et al.2014a]. VAEs consist of a pair of networks, the encoder and the decoder, and explicitly maximize the variational lower bound with respect to the joint generative and inference models. GANs take a game theoretic approach and tries to achieve a good balance between two strategic agents, the generator and discriminator networks. These generative models have been applied on complicated real-world data, with a variety of applications: in graphics, geometric modeling, and even for designing cryptographic primitives [Abadi and Andersen2016].
A successful machine learning model captures information from the training dataset. At the same time, an adversary may gain knowledge about the training dataset by interacting with the model. This information leakage problem is formulated by themembership inference problem [Shokri et al.2016, Yeom, Fredrikson, and Jha2017]: given a model and a data point, determine whether the data point was in the training dataset of the model. This can be considered as an attack to the privacy of the training data set. A successful membership attack to a machine learning model means that the privacy of the training data was not sufficiently protected when the trained model is released.
In the literature, most membership attacks were designed for classifiers. Fredrikson et al.[Fredrikson et al.2014] proposed to infer sensitive features of input data by actively probing the outputs. Later, a more robust inversion attack was developed [Fredrikson, Jha, and Ristenpart2015], in which attackers can recover part of the training set, such as human faces. Shokri et al. [Shokri et al.2017]
tried to predict whether a data point belongs to the training set. More recently, a GAN-based trick has been proposed to attack collaborative deep learning[Hitaj, Ateniese, and Perez-Cruz2017] for a distributed machine learning systems, in which users collaboratively train a model by sharing gradients of their locally trained models through a parameter server. Given the fact that Google has proposed federated learning based on distributed machine learning [McMahan and Ramage2017] and already deployed it to mobile devices, such a GAN-based attack raises severe privacy threats.
Not much has been done on membership attacks of generative models. [Hayes et al.2017]
considered attacking GANs and offered rankings for the target data based on the probability of how likely the instance is in the training set, but they cannot attack on individual instances. In DefenseGAN[Samangouei, Kabkab, and Chellappa2018], a defense/attack method called Direct Projection was introduced, in which the input variable to GANs is taken as trainable variables and gradient descent is used to minimize the difference of the generated data and target data .
Our Contribution. In this article, we propose new membership attacks and new attack methods for generative models. First we consider both single membership attacks and co-membership attacks. In a single membership attack, we are given a single instance and ask whether was used in the training. In a co-membership attack (co-attack), we are given instances with the knowledge that either all instances were used in training or none was used. The co-membership attack model happens in many real world scenarios, as often an user contributes multiple data points to a training task – for example, multiple pictures uploaded from a smart phone, or multi-day data from a smart meter device.
We propose a new attack method. Given a generative model with a target (or targets in a co-attack), we optimize another neural network to produce an input to the generative network such that the generated output nearly matches (or reproduces each of the targets). If we are able to reproduce elements in the training data but unable to reproduce samples not in the training data, a membership attack is successful. Notice that this is completely unsupervised and the attacker network starts with random values. Compared to the technique used in DefenseGAN, our method works well with co-membership attacks, as knowledge from multiple target instances could be extracted and shared through the attacker network while the direct projection method cannot share information across different target instances.
We evaluated our new attack methods for both VAEs and GANs on a variety of data sets (MNIST, fashion MNIST, and CIFAR-10) with a suite of parameters (different number of training data, different architectures, training iterations, etc). Our observations are:
Co-membership attack are more powerful than single attacks, even when .
Our neural network based attacker is more powerful than other optimization methods using gradient descent directly in the latent space (e.g., DefenseGAN).
VAEs are more susceptible to membership attacks compared to GANs in general.
In addition, we also discuss the success of membership attack with a few other quality measures for generative model. Generalization is arguably the most important objective for a machine learning model, in which the model produces useful results not only on the training data but also on data that the model has not seen. For supervised learning, the generalization error is the difference between classification error on the training set and error on the underlying joint probability distribution. For generative models, there has not been a standard measure of generalization. With a similar intuition, we measure thegeneralization gap by the difference of the reconstruction error by the membership attacker on the training data and on the test data. Secondly, diversity describes whether the generator can produce different samples [Arora, Risteski, and Zhang2018]. We use a measure called dispersion to evaluate the diversity qualitatively. Dispersion with a parameter is defined as the largest value of closest pairwise distance of points in the generated data. We show that empirically, dispersion, generalization error, and success rate of membership attack are all closely correlated. When we take random samples from the training data – the more training data we use, the smaller the generalization gap, the higher dispersion in the generated data.
Last, we display a distinctive difference between diversity and generalization. We apply our membership attack to choose adversarial samples, that are hard to be reproduced in the training phase by an attacker network. When we train a new GAN using only these adversarial samples, compared to a GAN trained by random samples with the same number of training data, we can observe a sharp contrast of diversity versus generalization. The GAN with adversarial sampling shows much higher dispersion, but it is easy to launch membership attacks with a large generalization gap. This shows that high diversity of a generative model cannot immediately guarantee good quality and generalization. Adversarial samples focus too much on corner cases and therefore may not effectively approximate the real distribution.
2 Membership Attacks Against Deep Generative Models
In this section, we introduce an efficient and unsupervised membership attack against different deep generative models, and show that the generative models can indeed overfit and reveal the training data.
Let be a data instance with dimension . is the distribution from which is sampled from. So . The objective of GANs [Goodfellow et al.2014b] is to learn the distribution ; practically GANs can be used to generate new samples to approximate the distribution. To achieve the goal, GANs consist of two components, the generator () and the discriminator (). produces a sample and tries to distinguish how likely the sample is from the output distribution of generator , where and are neural network parameters. To generate a new data point , generator takes a random
-dimensional vectorand returns . The discriminator is a function . The output of is interpreted as the probability for the data to be drawn from . The objective of a GAN is
In Wasserstein GAN (WGANs) [Arjovsky, Chintala, and Bottou2017], the measuring function is the identity function instead of the function in which the resultant objective is
In the paper and our implementation, the Lipschitz constraint on the critic is implemented with weight clipping. Empirically, WGANs are observed to better behave than the vanilla counterpart so we use WGANs in our experiment. We also test a recent variant of WGANs in [Gulrajani et al.2017], denoted as iWGANs, which optimizes the same objective but enforcing the Lipschitz condition by penalizing the norm of the gradient of the critic.
To perform membership attack against the generator of a trained GAN for a given target instance , we propose to introduce an attacker to synthesize a seed for so as to generate an instance close to . The pipeline of the membership attack is shown in Figure 1. Specifically, the attacker is a neural network , parameterized by , which takes as input and maps from . Here is the input dimension of the generator. The objective of the attacker is to minimize a distance (reconstruction loss for the attacker) between the data point and the generated data :
The result of the optimization problem 1 is used directly to determine the membership of . Intuitively, smaller reconstruction error indicates that is more likely to be from the training data. Note that the parameter of the attacker network is randomly initialized for a new attack so it does not require any pre-training before performing any attack in this unsupervised manner.
L2-distance is taken as our distance function throughout the paper. We would like to remark that our proposed attack method is not specific to L2-distance. For different datasets, it is possible that other application oriented metric might be used.
VAEs [Kingma and Welling2013b] are good at changing or exploring variations on existing data in a desired, specific direction. VAEs consist of a pair of connected networks, an encoder and a decoder. An encoder network takes in and converts an input into a dense representation, while the decoder network converts it back to the original input. The encoder takes input data and outputs two vectors: a vector of means
, and a vector of standard deviations. With re-parameterization, is obtained by sampling . The decoder takes and generates a data point . The entire network is trained with the objective as
where is the KL-divergence of distributions .
Similarly when conducting membership attack against VAEs, we search for a particular that can reproduce the target image when is fed to the decoder . The objective of the attacker is again:
In both optimization objectives 1 and 2, the attacker network is trying to invert the generator. The generator takes a seed and outputs , and the attacker takes and looks for . In VAE, if the compact representation of the encoder has the same dimension as the seed, this implies that the attacker is very similar to the encoder.
White-box v.s. Black-box.
If the internal structure of the generator (decoder) is exposed (i.e., in a white box attack), we can find an analytical gradient of the distance w.r.t. . Otherwise, it is called a black box attack. And we need to use finite-difference to approximate the gradient: where . In this case, the optimization of the attacker requires more black-box accesses to the generator (decoder). In this work, we focus on the white-box setting to explore what a powerful adversary can do based on the Kerckhoffs’s principle [Shannon1949] to better motivate defense methods.
Single Attack v.s. Co-Attack.
The attack framework introduced is able to launch membership attack against a single target by optimizing an instance of attacker network . This is called a single attack. If the attacker has more information about several target instances (for example, the target instances are known to be either all from training or testing data), we can co-optimize one single to co-attack related instances at the same time instead of initializing a new for each target instance. The information of multiple instances will be fused together to guide a attacker network . This is termed as a co-attack with strength if target instances are handled together with the prior knowledge that they have the same label. The new attacker loss is defined as the average of the reconstruction loss for each of the instances. So the objective of such co-attacker:
Without modeling the attacker as a neural network, such a co-attack will be difficult. We observe that in the experiments, the proposed co-attack is significantly more successful across models and datasets when increases. This shows the efficiency of co-attackers to leverage such information.
3 Experimental Results
We tested our attacks on three data sets, MNIST [LeCun et al.1998], fashion MNIST [Xiao, Rasul, and Vollgraf2017]. and CIFAR-10 [Krizhevsky2009]. L2 regularization is applied to weights and biases. Adam Optimizer is used with learning rate for the generator and for the discriminator. Gradient penalty is applied as in [Gulrajani et al.2017] for iWGANs. The network architecture and running environments can be found in the supplementary materials. We use the same number of parameters and similar network structure for different generative models.
Membership attack - MNIST
After the GANs is trained, we launch single membership attacks independently against random instances of training and non-training images. The attacker loss on each image is recorded. If the loss is smaller than a threshold, this image is declared to be from training dataset. The receiver operating characteristic (ROC) curve of such a binary classifier is plotted in Figure 2 by changing the discriminating threshold.
From Figure 2, we observe that the ROC curve for the GANs trained with different training data size varies significantly. If the training dataset consists of hundreds of images, the attacker can easily find a good seed for the generator to reproduce the training images. Thus, there is a drastic difference in the behaviors of optimizing attacker neural networks when it is facing a training or non-training image. This results in high attack effectiveness for the simple binary classifier. The ROC curve indicates that the attacker can have around 80% of successful detection rate while not making any mistake in terms of false positive. However, when the training dataset size increases to thousands (orange and yellow curves), this difference vanishes. Thus, the ROC curves are very close to the diagonal.
For references, we compare two unsupervised methods with ours. One is a naive baseline method, named Nearest Neighbor, in which we simply compare the given instance with the nearest neighbor in a set of generated data ( images, such that the computational cost is comparable to other methods). The minimum distance is considered as the attacker loss for this baseline attacker. The second compared target is motivated by DefenseGAN [Samangouei, Kabkab, and Chellappa2018], called Direct Projection, in which we take the noise as trainable variables and use gradient descent to adjust the noise directly using the new objective . This direct method is not able to launch a co-attack since it is not possible to share information across difference instances.
We compute the AUC value of our membership attack for different neural networks (WGANs, VAEs) and report the results in Table 1. The performance evaluation of both single attacks and co-attacks of strength (assuming that -tuples of target instances belong to the same label of training or testing) is explained in the following. For single attacks, our attacker outperforms the nearest neighbor and direct projection across datasets and models. Our attacker is extremely successful for VAEs. For co-attacks, our method is always successful with AUC value nearly . Note that for WGANs on fashion MNIST and Co-Attacker (), we need more iterations because the information required for the attacker to learn from is more. When we use iterations, the co-attack AUC can reach .
For iWGANs, the results have a similar trend. Since iWGANs generalize better than WGANs (as shown in the supplementary materials), it is harder to launch membership attacks.
Membership attack - CIFAR10
For CIFAR10, the preliminary results show that it is less efficient to attack GANs trained on dataset with more than images (with AUC of ) and the attack is quite successful when the model overfits (trained with images with AUC of ).
4.1 Relationship between Membership Attack and Model Generalization
One of the important issues is to understand when membership attack is successful. Membership attack for classifiers (supervised learning) was shown to be closely related to the generalization capability of the model [Yeom, Fredrikson, and Jha2017]. As to generative models, there has not been an explicit discussion of how to relate model generalization with membership attack yet. Intuitively, we want a generalizing model to have similar training and testing reconstruction loss – the attacker is equally able to reproduce a training/testing image. Note that this generalization condition is only a necessary condition for distribution learning. A generator that is able to produce every element in the data distribution (i.e., the generator ‘covers’ all samples with non-zero measures in ) cannot guarantee that it is generating the samples according to the distribution .
Now, we make this intuition precise by measuring the generalization gap of a generative model to by the difference between the expected attacker loss on the training data (the finite samples) and non-training data (testing data not used by the training procedure) which comes from the same underlying data distribution :
In supervised learning, we observe that for a given hypothesis class of classifiers, the generalization gap decreases when the number of training data increases. With the proposed generalization error for generative model, we find a similar pattern in Figure 2(a). For WGANs model trained on MNIST data and stopped after training steps, the generalization gap diminishes when we use thousands of training data. We also plot the success rate of the membership attacks. The success rate of membership attacks and generalization gap are strongly correlated. This can be shown in Figure 2(a). The experimental details of this figure are the same as those described in Section 3.
On the other hand, the learning curve in Figure 2(b) depicts how the training and testing errors behave when the number of training steps increases. For deep learning, the curve provides an early stopping scheme to prevent overfitting. As to the case of generative model, the training error also decreases steadily while the testing error increases again after some steps. This shows that generative model is easily overfitting (over-train).
A desirable property of a generative model is the capability of generating versatile samples. This is termed diversity in the literature. The evaluation of diversity, however, has not reached a consensus in the community. In particular, an obvious failing mode to avoid is that the GAN can memorize the training data and simply report the training samples. There have been a number of proposed methods to test whether this is the case. For example, one test checks each generated image whether it is similar to any in the training data. Another test considers taking two random seeds
and checks if the interpolationfor generates realistic outputs. Actually, no measure of diversity is explicitly defined. In a very recent work [Arora, Risteski, and Zhang2018], the authors have proposed to use a birthday paradox test: if the GAN has simply memorized the training sample and randomly output one each time, then with roughly samples in the output there is a good chance that two of them are the same. It is then suggested in [Arora, Risteski, and Zhang2018] to visually examine the images to identify duplicates.
Here we use a geometric measure of diversity, called dispersion, which seeks for a subset of images in the generated output that are far away from each other feature-wisely.
If the output images are concentrated around a small number of samples, then the dispersion will dramatically drop when goes beyond . In Figure 4, we can observe such patterns. For GANs trained with and images, they are not able to generate diversified images so that their dispersion is significantly lower than that of the testing dataset.
Connection with Generalization.
When a model generalizes, the dispersion of the generated data is similar to the dispersion of the data distribution. In Section 4.1, the generalization gap becomes smaller while we have more training data. So one can produce diversified data that are not seen in the training data. Data dispersion serves as an evidence for such diversification. When a generative model is successful, it should interpolate or even extrapolate beyond the original training data which results in a larger data dispersion.
In Figure 5, we visualize dispersion by embedding the generated images and the training data on a two-dimensional map using t-SNE [Maaten and Hinton2008]. The red dots are the training data, the number of which is the only changing variable in the subfigures. The green dots show the generated images sampled from the GANs after training. The black dots represent samples from the real distribution. When there are only a small number of training images, the generated samples heavily concentrate around the training data which produces a small dispersion. Once the number of training data is , the dispersion of the generated data and original data becomes similar.
|WGANs / VAEs||Adversary|
. ReLU, BN.
, stride=, 64. ReLU.
|FC . ReLU, BN.||Conv , stride=, 128. ReLU.||FC|
|Transpose Conv , stride, . ReLU, BN.||FC . ReLU.||FC|
|Transpose Conv , stride,. ReLU, BN.||FC Conv , channel=. Tanh|
4.3 Diversity versus Generalization
It is interesting to ask whether diversity and generalization capabilities are the same. In the literature, diversity has been one of the most desirable properties of generative models. And it is tempting to conclude that a model with good diversity has good generalization. In this section, we point out that these two measures (goals) are not always aligned. One could carefully choose training data so that diversity is enhanced yet generalization is hurt – i.e., membership attacks are more likely to be successful.
For example, when we take a batch of data that have not been used in the training, , and assign a ranking to them by the decreasing order of the attacker loss (Eqn 1) with respect to the current generator. We include the one ranked highest (the one that is the least reproducible) in . Then we modify the current GAN by training on all data in the current batch and repeat the above procedure. Eventually, we have a subset that is the hardest one to reproduce in the process.
After the sampling algorithm has completed, a subset of data points is collected from the real distribution . Now we train a GAN from scratch using data in . In comparison, we also train a separate GAN using a set with , randomly selected from . We denote to be obtained by adversarial sampling and as random sampling.
Hereafter, we compare the dispersion and membership attack on GANs trained with adversarial sampling and uniform sampling. The dispersion of generated data using is even higher than the dispersion of the original training data (Figure 5(a)). But in terms of membership attack, GANs trained by adversarial samples are much worse (Figure 5(b)). One way to understand this result is that the adversarial samples might have paid too much attention on extreme cases while random sampling is a better representation of .
4.4 Network architecture
4.5 Experimental Results on VAEs
The results for VAEs are provided in the supplementary materials. It is commonly observed in practice that VAEs generate blurry images [Zhao, Song, and Ermon2017]. This phenomena is also captured by the low dispersion of the generated images. We have found the fact that even when VAEs are trained with thousands of MNIST images, it cannot reach the same dispersion as their GANs counterpart. There is a gap between dispersion of the original data distribution and the synthetic distribution by VAEs.
Similarly, when more training data is used, VAEs are more resilient to membership attacks. Furthermore, we have learned that the generalization gap is generally higher than the corresponding GANs using the same number of training images. This is consistent to the results in Table 1; thus, indicating attacks to VAEs are easier.
4.6 Experimental Results on iWGANs
The attacker AUC and generalization gap for iWGANs, with different number of training data, is presented in Figure 8. Generally speaking iWGANs have smaller generalization gap (higher resilience to membership attacks) compared to WGANs, when the same number of training data is used.
Privacy of training data is an important issue in learning so how to protect data privacy is a natural next step. The work in the paper is largely empirical as many papers on generative models. We hope this work encourage more works on this important topic.
- [Abadi and Andersen2016] Abadi, M., and Andersen, D. G. 2016. Learning to protect communications with adversarial neural cryptography. arXiv preprint arXiv:1610.06918.
- [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
- [Arora, Risteski, and Zhang2018] Arora, S.; Risteski, A.; and Zhang, Y. 2018. Do GANs learn the distribution? some theory and empirics. In Proceedings of the Sixth International Conference on Learning Representations.
- [Fredrikson et al.2014] Fredrikson, M.; Lantz, E.; Jha, S.; Lin, S.; Page, D.; and Ristenpart, T. 2014. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In USENIX Security Symposium.
- [Fredrikson, Jha, and Ristenpart2015] Fredrikson, M.; Jha, S.; and Ristenpart, T. 2015. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security.
- [Goodfellow et al.2014a] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014a. Generative adversarial nets. In NIPS, 2672–2680.
- [Goodfellow et al.2014b] Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014b. Generative adversarial networks.
- [Gulrajani et al.2017] Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein GANs. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc. 5767–5777.
- [Hayes et al.2017] Hayes, J.; Melis, L.; Danezis, G.; and De Cristofaro, E. 2017. Logan: Evaluating privacy leakage of generative models using generative adversarial networks. arXiv preprint arXiv:1705.07663.
- [Hitaj, Ateniese, and Perez-Cruz2017] Hitaj, B.; Ateniese, G.; and Perez-Cruz, F. 2017. Deep models under the gan: Information leakage from collaborative deep learning. CCS.
- [Kingma and Welling2013a] Kingma, D. P., and Welling, M. 2013a. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- [Kingma and Welling2013b] Kingma, D. P., and Welling, M. 2013b. Auto-Encoding variational bayes. ICLR 2014.
- [Krizhevsky2009] Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Technical report.
- [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
- [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov):2579–2605.
- [McMahan and Ramage2017] McMahan, B., and Ramage, D. 2017. Federated learning: Collaborative machine learning without centralized training data. Technical report, Technical report, Google.
- [Samangouei, Kabkab, and Chellappa2018] Samangouei, P.; Kabkab, M.; and Chellappa, R. 2018. Defense-GAN: protecting classifiers against adversarial attacks using generative models. In Proceedings of the Sixth International Conference on Learning Representations (ICLR).
- [Shannon1949] Shannon, C. E. 1949. Communication theory of secrecy systems. Bell Labs Technical Journal 28(4):656–715.
- [Shokri et al.2016] Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2016. Membership inference attacks against machine learning models.
- [Shokri et al.2017] Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2017. Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on, 3–18. IEEE.
- [Xiao, Rasul, and Vollgraf2017] Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms.
- [Yeom, Fredrikson, and Jha2017] Yeom, S.; Fredrikson, M.; and Jha, S. 2017. The unintended consequences of overfitting: Training data inference attacks.
- [Zhao, Song, and Ermon2017] Zhao, S.; Song, J.; and Ermon, S. 2017. Towards deeper understanding of variational autoencoding models.