Recent developments in adversarial machine learningDBLP:journals/corr/abs-1802-00420 ; DBLP:conf/nips/SongSKE18 ; zhao2018generating
have cast serious doubts on the robustness of deep learning models. Although many defense mechanismsDBLP:journals/corr/abs-1903-06603 ; sinha2018certifiable ; DBLP:journals/corr/abs-1801-09344 ; DBLP:journals/corr/MadryMSTV17 ; DBLP:journals/corr/abs-1711-00851 have been proposed to alleviate the security risks faced by these models, very few are resilient to attacks DBLP:journals/corr/abs-1802-00420 ; DBLP:journals/corr/CarliniW16a
. Adversarial attacks manipulate data with imperceptible noise so as to cause a classifier to make incorrect predictions42503 . In the image domain, adversarial examples are generated to be identical to real images, albeit the precise locations of fine details in the images might still not be preserved. In the text domain, even slight changes to a sentence can alter its readability or corrupt its meaning zhao2018generating . For adversarial samples to be coherent, they need to be valid and convey ideally the meaning of the true inputs. In the text domain especially, they need to be grammatically and linguistically sound. We posit the reason existing approaches fail to enforce such requirements is because the semantics of the true inputs and the adversarial examples, captured in their respective low dense representation or manifold, do not necessarily align. This may result from a poor characterization of the manifold of the true inputs due to rigid assumptions about its structure – such as confining the latent variable model of the manifold to be Gaussian DBLP:conf/nips/SongSKE18 ; zhao2018generating – or because the search for adversarial examples is restricted to uniformly-bounded regions or conducted along suboptimal gradient directions 42503 ; DBLP:journals/corr/KurakinGB16 ; ianj.goodfellow2014 .
In this study, we explore a novel approach to generate coherent adversarial examples in the image and text domains. Our approach is built around the manifold invariance theorem 10.1088/2053-2571/ab0281ch6 and seeks to align the manifolds of the adversarial examples and the true inputs so that the adversarial examples may reflect the semantics of the true inputs. To intuitively introduce our approach, we decompose the adversarial learning problem into: (i.) manifold learning
, where we develop a variational inference technique to encode high-dimensional data into a low dense representation without needing to reparameterize the encoder networkkingma2014 , and (ii.) perturbation invariance under the manifold invariance concept, where we describe a simple yet efficient perturbation strategy that ensures cases we perturb remain in the manifold of the true inputs. We subsequently illustrate how the rich latent structure exposed in the manifold can be leveraged as a source of knowledge upon which adversarial constraints can be imposed; this while learning the manifold and without resorting to exhaustive search as per common practice DBLP:conf/nips/SongSKE18 ; zhao2018generating ; ianj.goodfellow2014 . Finally, we apply our approach to images and text data in a black-box setting to generate adversarial examples that are coherent and perceptually similar to the true inputs.
The main contributions of our work are thus: (i.) a novel variational inference method for manifold learning in the presence of continuous latent variables with minimal assumptions about their distribution, (ii.) an intuitive perturbation strategy that encourages perturbed elements of a manifold to remain within the manifold, (iii.) an end-to-end pipeline that combines (i.) and (ii.) to generate adversarial examples that follow the same distribution as the inputs, and (iv.) illustration on images and text, as well as empirical validation against strong certified and non-certified adversarial defenses.
2 Problem Setup & Architecture
Problem Setup. Consider a dataset of training examples, their corresponding classes, and a black-box classifier. We want to generate adversarial examples that are coherent. By “coherent”, we mean given an example , we want its adversarial example to come from a distribution similar to of . In particular, we require to be the nearest such instance to in the manifold that defines and subject to adversariality. To produce coherent adversarial examples, we devise our approach around the invariant manifold concept defined in 10.1088/2053-2571/ab0281ch6 which we revisit below.
Manifold Invariance 10.1088/2053-2571/ab0281ch6 . Let be a metric space and a homeomorphism. For any point , a neighborhood of is invariant under if: .
Definition 2.1 stipulates that the topology of is preserved under as any projection of remains in . To see how Definition 2.1 can be enforced in an adversarial setting, we confine for now to an -radius ball around and let be the class of . If there exists a point s.t and , then is adversarial to athalye2017 and is neighborhood-preserving.
In this paper, we mainly operate in the low embedding space of we denote . Earlier, we restricted the search region for the adversarial examples of to the uniformly-bounded ball . In reality, the search region may not be as well-defined and may even be difficult to characterize especially in the dense regions of . Thus, to adapt Definition 2.1 to , we consider instead two embedding maps and parameterized respectively by and , and use to find points that lie in the vicinity of in such that when mapped back onto , the input space of , they induce adversariality. We assume here that and follow the implicit distributions and 111Treatment of and as random variables as in
as random variables as inkingma2014 allows for model averaging yuchen2017 which we leverage..
Adversarial Learning Objective. Let be metric spaces and a mapping function. We want to learn , , and s.t the divergence is arbitrarily small, and for any , is also arbitrarily small and .
Model Architecture. To attain our adversarial learning objective, we propose as a framework the architecture illustrated in Figure 2 that we design according to Definition 2.2. Our framework is essentially a variational auto-encoder with two encoders that learn to approximate the variational posterior over instead of its reparameterized form kingma2014 . To succeed in this task, we introduce two inference mechanisms – implicit manifold learning via Stein variational inference and Gram-Schmidt basis sign method – to draw instances of model parameters from the implicit distributions and that we parameterize the two encoders with. Both encoders optimize the uncertainty inherent to embedding while guaranteeing easy sampling via Bayesian ensembling. The decoder acts as a generative model for crafting adversarial examples and a proxy for creating latent targets in the space in order to optimize the first encoder; a process we refer to as inversion and depict in Figure 2.
Unlike most perturbation-based approaches athalye2017 ; Shaham2018UnderstandingAT that search for adversarial examples in the input space of , we leverage the dense representation of the latent space of ; the intuition being here that the latent space captures well the semantics of . Thus, rather than finding the adversarial example of a given input in , we learn to perturb its latent code in a way that the perturbed version and lie in the same manifold. Then, we construct using our decoder . By learning to efficiently perturb the latent codes, and then mapping these low-dimensional representations back onto the input space, we control the perturbations we inject to the adversarial examples thereby ensuring that they are perceptually similar to the inputs. In the following, we refer to the manifold of as the set of locally Euclidean latent representations (or points) 10.1088/2053-2571/ab0281ch6 of instances of .
3 Technical Background
Manifold Learning. To uncover structure in some high dimensional data and understand its meta-properties, it is common practice to map to a low dimensional subspace where explanatory hidden features may become apparent. Manifold learning is based on the assumption that the data of interest lies on or near lower dimensional manifolds in its embedding space. In the variational auto-encoder (VAE) kingma2014 setting upon which we build our work, the datapoints are modeled via a decoder with a prior placed on the latent codes . To learn the parameters , one typically maximizes a variational approximation to the empirical expected log-likelihood, , called evidence lower bound (ELBO) and defined by:
To evaluate the objective , we require the ability to sample efficiently from , the encoder distribution which approximates the posterior inference . Specifically, we need a closed and differentiable form for in order to evaluate the expectation. The reparameterization trick kingma2014 provides a simple way to rewrite the expectation
such that the Monte Carlo estimate ofis differentiable w.r.t . More formally, under some mild differentiability conditions kingma2014 , the random variable can be reparameterized using a differentiable transformation of an auxiliary noise variable : where the prior is generally confined to a family of simple and tractable distributions
like a Gaussian distribution.
Stein Variational Gradient Descent (SVGD) qiangliu2016 is a nonparametric variational inference method that combines the advantages of MCMC sampling and variational inference. Unlike variational bayes in VAEs kingma2014 , SVGD does not confine the target distributions it approximates to simple or tractable parametric distributions while still remaining efficient. To approximate the target distribution , SVGD maintains particles , initially sampled from a simple distribution, it iteratively transports via functional gradient descent. Henceforward, we shall consider these particles as instances of model parameters. At iteration , each particle is updated as follows:
where is a step-size and is a positive-definite kernel. In eq. (2), each particle determines its update direction by consulting with other particles and asking their gradients. The importance of the latter particles is weighted according to the distance measure . Closer particles are given higher consideration than those lying further away. The term is a regularizer that acts as a repulsive force between the particles to prevent them from collapsing into one particle. The resulting particles can thus be used to approximate the predictive posterior distribution over :
4 Proposed Method
We first describe a novel variational inference method to learn the manifold of . Then, we introduce a technique to perturb the elements of while ensuring is invariant to the perturbations. Finally, we describe our end-to-end process for generating manifold preserving adversarial examples.
4.1 Implicit Manifold Learning
As described in section 3, it is customary to optimize the ELBO in eq. (1) when training VAEs due to the intractability of the data-likelihood. Moreover, we need a closed and differentiable form for the encoder . This is achieved by reparameterizing the encoder using a differentiable transformation of some auxiliary Gaussian noise variable. Constraining the noise to be Gaussian is however too restrictive 2015arXiv150505770J and may lead to learning poorly the manifold of the data DBLP:journals/corr/ZhaoSE17b . To alleviate this issue, one can minimize the divergence as in yuchen2017 instead of optimizing the ELBO explicitly. In yuchen2017 , for an input , the authors draw latent codes using a recognition network and update them via SVGD. Similar to yuchen2017 , we focus on the term and use a recognition network to sample model instances we denote particles. Unlike yuchen2017 , however, we generate the latent codes using instead of applying dropout noise on the recognition network.
We aim to leverage to map to some manifold in the latent space . This encoding inherently induces uncertainty we ought to capture in order to learn efficiently. Bayesian methods provide a principled way to model uncertainty through the posterior distribution over model parameters. Thus, we let every particle
define the weights and biases of a Bayesian neural network. For large, maintaining the particles can be computationally prohibitive because of the memory footprint. Furthermore, the need to generate the particles during inference for each test example is undesirable. To remedy this, we train a recognition network parameterized by that takes as input some noise and outputs . The parameters are updated through a small number of gradient steps in order to produce good generalization performance. If we let be the parameters of at iteration , and given which we defined in eq. (2), we get by:
In the following, we shall use SVGD) to denote an SVGD update of using . To apply SVGD, we need to evaluate which no longer depends on the prior but on the posterior (eq. 3) as is observed. Computing this posterior requires having a target for . Hence, we use our decoder to create a target for by first sampling , then reconstructing , and then re-sampling the target . We summarize this procedure in Figure 2.
As in yuchen2017 , plays a role analogous to in eq. (1). It gives us means to sample latent codes from a particle given an input without imposing an explicit and functional parametric form for . Unlike yuchen2017 , we do not apply SVGD on the latent codes but on the particles that generate the codes. By also being Bayesian, every particle provides some prior information on how to sample these codes. During training and inference, we use Bayesian model averaging to get a latent code from ; formally, if , for instance, then where the posterior is given in eq. (3).
4.2 Perturbation Invariance
Previously, we described a method to encode in its manifold . Here, we focus on learning to perturb the elements of . We require the perturbed elements to reside in so that they exhibit the features of . Intuitively, we seek a linear mapping such that if : 10.1088/2053-2571/ab0281ch6 . Linear spans, for instance, satisfy this condition. We say then that is -invariant or preserved under . Rather than finding directly, we introduce a new set of model instances and let every denote the weights and biases of a Bayesian neural network. Then, if and is its latent code with in eq.(3), we can set
We leverage the local smoothness of to learn each in a way to encourage to reside in in a close neighborhood of using a novel technique called Gram-Schmidt Basis Sign Method.
Gram-Schmidt Basis Sign Method (GBSM). Let be a minibatch of samples of . For , let be a set of latent codes where and . We aim to learn in a way to generate perturbed versions of along the directions of an orthonormal basis . Given that is locally Euclidean, we can compute the dimensions of its subspace by applying Gram-Schmidt to orthogonalize the span of representative local points. We formalize this objective as an optimization problem where we jointly learn the perturbations , the directions along which we should perturb , and . Formally, we minimize the objective :
To describe more the minimization process in eq. (5), we first sample a model instance given . Then, we generate the codes for all . We orthogonalize
and find the noise tensorthat minimizes along the directions of the basis vectors . The goal is to get the values of to be as small as possible. With fixed, we update by minimizing again.
Maintaining for a large can however be computationally prohibitive. Thus, as in section 4.1, we use a recognition network that takes as input some noise and generates . Here also, using eq. (5), we update through a small number of gradient steps.
In the following, we use the notation GBSM where to update at once.
Distribution Alignment. Although GBSM confers us clear benefits in the forms of latent noise imperceptibility and sampling speed, may deviate from in which case the manifolds they learn may differ. This understandably contradicts the requirement in Definition 2.2 that the divergence between the implicit distributions and , hence between their samples and , needs to be small; the intuition being to get the adversarial examples to reflect the semantics of the inputs by aligning their manifolds. To remedy this, we further regularize each after every GBSM update. In essence, we slightly modify SVGD to ensure the model instances follow the transform maps constructed by junhan2017 at each iteration using the operator :
Unlike in eq. (2), with the parameters determine their own update direction by consulting the particles alone instead of consulting each other. In the following, we shall use SVGD to denote the application of the gradient update rule in eq. (7) using the kernelized operator .
4.3 Manifold Preserving Adversarial Learning
In general, one can produce adversarial examples via white-box ianj.goodfellow2014 or black-box Papernot:2017:PBA:3052973.3053009 attack. In the more standard black-box case, which is the main focus of this study, we only have access to the predictions of a classifier and need to generate adversarial examples not knowing the intricacies of
nor having access to its loss function. In this setting, adversarial examples are produced by maximizing an auxiliary loss, here the log-likelihood, of a target class over an -radius ball around the input athalye2017 . This is formalized as:
Previously, we decomposed the adversarial learning problem into: (i.) manifold learning where we develop a new method to encode high-dimensional data into a low dense representation that faithfully characterizes its topological structure, and (ii.) manifold perturbation where we ensure cases we perturb remain in the manifold of the true inputs. Here, we show how both methods can be unified in one learning procedure to generate adversarial examples without resorting to exhaustive search.
We illustrate this end-to-end procedure in Algorithm 1 and describe it here. First, we sample , the particles that learn to model the distribution of in its manifold , from . Then, we sample , a set of transformations from we want to be invariant to under the perturbations these transformations induce. Both and act as encoders of but more faithfully than . Both also attempt to capture the complex uncertainty structure of by learning to approximate its predictive posterior. To optimize , we have to compute this posterior. We do so after the first inner update by using the inversion mechanism we described in Figure 2 in an iterative fashion. From updating , we optimize
using stochastic gradient descent. Then, we apply the objectivein eq. (7) to optimize , and subsequently update . This summarizes the inner-training of the recognition networks and . To generate adversarial examples, we extend the objective in eq. (8) as follows:
where is the reconstruction error of that we formalize in section 6. The second term is the cost incurred when failing to fool the classifier and is the strength of the adversarial examples.
5 Related Work
Manifold Learning. VAEs are generally used to learn manifolds DBLP:journals/corr/abs-1808-06088 ; 2018arXiv180704689F ; higgins2016 by maximizing the ELBO of the data log-likelihood DBLP:journals/corr/abs-1711-00464 ; 2017arXiv170901846C . Optimizing the ELBO requires to reparameterize the encoder to a simple distribution like a Gaussian kingma2014 . The Gaussian prior assumption is however too restrictive 2015arXiv150505770J and may lead to learning poorly the manifold of the data DBLP:journals/corr/ZhaoSE17b . To alleviate this issue, one can optimize instead the divergence between the encoder and the posterior inference as in yuchen2017 . Although our work and yuchen2017 are similar since we both use recognition networks, our approach is more general. In our case, generates the particles, which are model instances, from which we can sample infinite latent codes rather than finite pointwise estimates yuchen2017 . Also, given that our particles are Bayesian, we learn to capture better the uncertainty inherent to encoding data using VAEs.
Adversarial Examples. Studies in adversarial deep learning DBLP:journals/corr/abs-1802-00420 ; DBLP:journals/corr/KurakinGB16 ; ianj.goodfellow2014 ; athalye2017 can be categorized into two groups. The first group DBLP:journals/corr/CarliniW16a ; athalye2017 ; DBLP:journals/corr/Moosavi-Dezfooli16 proposes to generate adversarial examples directly in the input space by distorting, occluding or changing illumination in images to cause changes in classification. The second group DBLP:conf/nips/SongSKE18 ; zhao2018generating , where our work belongs, uses generative models to search for adversarial examples in the dense and continuous representation of the data rather than in its input space.
Adversarial Images. Song et al. DBLP:conf/nips/SongSKE18 propose to construct unrestricted adversarial examples in the image domain by training a conditional GAN that constrains the search region for a latent code in the neighborhood of a target . Zhao et al. zhao2018generating use also a GAN to map input images to a latent space where they conduct their search for adversarial examples. Both studies are closely related to ours. Unlike DBLP:conf/nips/SongSKE18 , however, our adversarial examples are not restricted. Contrary to DBLP:conf/nips/SongSKE18 and zhao2018generating , we do not need to set the noise level to inject to the latent codes of the true inputs as the perturbations are automatically learned. Moreover, in contrast to DBLP:conf/nips/SongSKE18 and zhao2018generating , where the search for adversarial examples is exhaustive and decoupled from the training of the GANs, our approach is end-to-end. Lastly, by capturing the uncertainty induced by the mapping of the data to latent space, we learn to characterize the semantics of the data better; which allows us thus to generate more realistic adversarial examples.
Previous studies on adversarial text generationpmlr-v80-zhao18b ; DBLP:journals/corr/JiaL17 ; DBLP:journals/corr/Alvarez-MelisJ17 ; DBLP:journals/corr/LiMJ16a
perform word erasures and replacements directly in the input space using domain-specific rules or heuristics or require manual curation. Similar to us, Zhao et al.zhao2018generating propose to search for textual adversarial examples in the latent representation of the data. However, in addition to the differences mentioned above, the search for adversarial examples is handled more gracefully in our case thanks to an efficient gradient-based optimization method in lieu of a computationally expensive search in the latent space.
6 Experiments & Results
We evaluate our approach on a number of black-box classification tasks of images and text, and illustrate in Appendix A with synthetic data that our adversarial examples are manifold-preserving.
Image Datasets. For image classification, we experiment with three standard datasets: MNIST LeCun:1989:BAH:1351079.1351090 , CelebA liu2015faceattributes and SVHN 37648 . For MNIST and SVHN, each image of a digit represents the class the image belongs to. For CelebA, we group the face images according to their gender (female, male), similar to DBLP:conf/nips/SongSKE18 , and focus on gender classification.
Text Datasets. For text classification, we consider the SNLI DBLP:journals/corr/BowmanAPM15 dataset. SNLI consists of sentence pairs where each pair contains a premise and a hypothesis, and a label indicating the relationship (entailment, neutral, contradiction) between the premise and hypothesis. For instance, the following pair is assigned the label entailment to indicate that the premise entails the hypothesis.
Premise: A soccer game with multiple males playing. Hypothesis: Some men are playing a sport.
We embed the image and text inputs using a convolutional neural network (CNN). To generate adversarial images, we design the decoderas (i.) a transpose CNN, and (ii.) a pre-trained GAN pmlr-v70-odena17a . For adversarial text generation, we consider the following designs for : (i.) a transpose CNN, (ii.) a language model, and (iii.) using a pre-trained ARAE pmlr-v80-zhao18b model. The recognition networks and
are fully connected neural networks of similar architecture. To evaluate our approach, we target a range of adversarial models. For more details on the architectures of these models, features extraction,, , and the decoder , we refer the reader to Appendix B.
6.1 Generating Adversarial Images
Setup. As argued in DBLP:journals/corr/abs-1802-00420 , the strongest non-certified defense available against adversarial attacks is adversarial training with Projected Gradient Descent (PGD) DBLP:journals/corr/MadryMSTV17 . Thus, we evaluate the strength of our MNIST, CelebA and SVHN adversarial examples against adversarially trained ResNet DBLP:journals/corr/HeZRS15 models with a 40-step PGD and noise margin less than 0.3. The ResNet models follow the architecture design of DBLP:conf/nips/SongSKE18 . For MNIST, we also target the certified defenses DBLP:journals/corr/abs-1801-09344 and DBLP:journals/corr/abs-1711-00851 with set to 0.1.
We formalize in eq. (9) respectively as two -norm losses and a discriminative loss. For the p-norm case, we first reconstruct using the true inputs as targets, i.e., where and are samples we generate. For the latter, we use a pre-trained Wasserstein formulation of AC-GAN pmlr-v70-odena17a to generate samples and set . Finally, we use the discriminator of the same GAN to discriminate between and . We experiment with all three variants and report only the results for where the adversarial success rates and sample quality are higher.
Adversarial Success Rate (ASR). Against the non-certified defenses, we achieve an ASR of 97.2% for MNIST, 87.6% for SVHN, and 84.4% for CelebA. The certified defenses on MNIST guarantee that no attack with less than 0.1 can have a success rate larger than 35% and 5.8% respectively DBLP:journals/corr/abs-1801-09344 ; DBLP:journals/corr/abs-1711-00851 . However, our ASR against these defenses is 95.2% and 96.8%. For all the datasets, the latent noise levels are smaller than , and the accuracies of the target models are higher than 96.3%; thus proving the effectiveness of our method in generating strong adversarial examples.
Noise Level. We compute the latent strength of the adversarial examples by evaluating the average spectral norm of the learned perturbations over few minibatch samples. For MNIST, CelebA, and SVHN, the latent nuisances are respectively , and , all smaller than , and significantly lower than in DBLP:conf/nips/SongSKE18 222Note that our results are not directly comparable with Song et al. DBLP:conf/nips/SongSKE18 as their reported success rates are for targeted unrestricted adversarial examples manually computed from Amazon MTurkers votes unlike ours.. Beyond the imperceptibility of our adversarial examples, these noise levels show that the distributions the particles and the model instances follow are similar. This is further illustrated in Figure 7, with the KDE plots, by the overlapping marginal distributions of the clean and perturbed latent codes generated respectively from using and ; thus resulting in good sample quality as exemplified in Figure 14.
|True Input 1||P: A biker races. H: A person is riding a bike. Label: Entailment|
|Adversary 1||H: A man races. Label: Contradiction|
|True Input 2||P: The girls walk down the street. H: Girls walk down the street. Label: Entailment|
|Adversary 2||H: A choir walks down the street. Label: Neutral|
|True Input 3||P: Two dogs playing fetch. H: Two puppies play with a red ball. Label: Neutral|
|Adversary 3||H: Two people play in the snow. Label: Contradiction|
6.2 Generating Adversarial Text
Setup. We perturb the hypotheses sentences to attack our SNLI classifier with the premise sentences kept unchanged. Similar to zhao2018generating , we use ARAE pmlr-v80-zhao18b for word embedding, and a CNN for sentence embedding. To generate adversarial sentences from the perturbed latent codes, we experiment with three decoders: (i.) is a transpose CNN, (ii.) is a language model, and (iii.) we use the decoder of a pre-trained ARAE pmlr-v80-zhao18b model. In all three cases, we use as reconstruction loss (see eq. 9) the cross-entropy loss and condition the generation of the adversarial hypotheses on the sentence pairs premises. We detail the configuration design of each decoder in Appendix B. We generate adversarial text at word level using a vocabulary of 11,000 words only similar to zhao2018generating .
Adversarial Success Rate (ASR). Using the transpose CNN, we achieve an ASR of 67.28% against the SNLI classifier that has an accuracy of 89.42%. Table 1 shows some examples of the generated adversarial hypotheses. The adversarial examples look legible and informative, and convey generally the meaning of the perturbed sentences. Sometimes, however, we notice slight changes in the meaning of the true hypotheses. We do not report the results we achieve with the other two designs of the decoder as the generated adversarial examples are mostly illegible. We posit that this is due to the compounding effect of the perturbations on the language model and a distribution shift with ARAE. We discuss these limitations and provide more examples of adversarial text samples in Appendix A.
Many adversarial learning approaches fail to enforce the semantic-relatedness that ought to exist between true inputs and their adversarial counterparts. Motivated by this fact, we developed an approach tailored to ensuring that the true inputs and their adversarial examples exhibit similar semantics by restricting the search for adversarial examples in the manifold of the true inputs. Our success rates against certified and non-certified defenses known to be resilient to traditional adversarial attacks illustrate also the effectiveness of our approach in generating strong adversarial examples.
=0mu plus 1mu
-  Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018.
-  Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Constructing unrestricted adversarial examples with generative models. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 8322–8333, 2018.
-  Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In International Conference on Learning Representations, 2018.
-  Chen Liu, Ryota Tomioka, and Volkan Cevher. On certifying non-uniform bound against adversarial attacks. CoRR, abs/1903.06603, 2019.
-  Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.
-  Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. CoRR, abs/1801.09344, 2018.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/1706.06083, 2017.
-  J. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. CoRR, abs/1711.00851, 2017.
-  Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
-  Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
-  Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harvesting adversarial examples. CoRR abs/1412.6572, 2014.
-  Marc R Roussel. Invariant manifolds. In Nonlinear Dynamics, 2053-2571, pages 6–1 to 6–20. Morgan & Claypool Publishers, 2019.
-  Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
-  Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. CoRR, abs/1707.07397, 2017.
-  Yunchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. VAE learning via Stein variational gradient descent. Neural Information Processing Systems (NIPS), 2017.
-  Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing, 307:195–204, 2018.
Qiang Liu and Dilin Wang.
Stein variational gradient descent: A general purpose Bayesian inference algorithm.Neural Information Processing Systems (NIPS), 2016.
-  Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. arXiv e-prints, page arXiv:1505.05770, May 2015.
-  Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: Information maximizing variational autoencoders. CoRR, abs/1706.02262, 2017.
Jun Han and Qiang Liu.
Stein variational adaptive importance sampling.
Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
-  Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pages 506–519, New York, NY, USA, 2017. ACM.
-  Bing Yu, Jingfeng Wu, and Zhanxing Zhu. Tangent-normal adversarial regularization for semi-supervised learning. CoRR, abs/1808.06088, 2018.
-  Luca Falorsi, Pim de Haan, Tim R. Davidson, Nicola De Cao, Maurice Weiler, Patrick Forré, and Taco S. Cohen. Explorations in Homeomorphic Variational Auto-Encoding. arXiv e-prints, page arXiv:1807.04689, Jul 2018.
-  Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, and Alexander Lerchner. Early Visual Concept Learning with Unsupervised Deep Learning. arXiv e-prints, page arXiv:1606.05579, Jun 2016.
-  Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and Kevin Murphy. An information-theoretic analysis of deep latent-variable models. CoRR, abs/1711.00464, 2017.
-  Liqun Chen, Shuyang Dai, Yunchen Pu, Chunyuan Li, Qinliang Su, and Lawrence Carin. Symmetric Variational Autoencoder and Connections to Adversarial Learning. arXiv e-prints, page arXiv:1709.01846, Sep 2017.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. CoRR, abs/1610.08401, 2016.
-  Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush, and Yann LeCun. Adversarially regularized autoencoders. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5902–5911, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. CoRR, abs/1707.07328, 2017.
-  David Alvarez-Melis and Tommi S. Jaakkola. A causal framework for explaining the predictions of black-box sequence-to-sequence models. CoRR, abs/1707.01943, 2017.
-  Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016.
-  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, December 1989.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of International Conference on Computer Vision (ICCV), December 2015.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
-  Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. CoRR, abs/1508.05326, 2015.
-  Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2642–2651, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
-  Alec Radford. Improving language understanding by generative pre-training. In arXiv, 2018.
-  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a nash equilibrium. CoRR, abs/1706.08500, 2017.
Appendix A: Discussion & Additional Results
Discussion: Here, we discuss some of the choices pertaining to the design of our approach, and our pilot study to manually evaluate adversarial examples we generate. We also discuss the limitations of our method and argue why a fair side-by-side comparison with [2, 3] is impossible. Finally, via empirical validation using toy data, we show that our adversarial examples are manifold-preserving.
Space/Time Complexity. As noted in , the Gaussian prior assumption in VAEs might be too restrictive to generate meaningful enough latent codes . To relax this assumption and produce informative and diverse latent codes, we used SVGD. To generate manifold preserving adversarial examples, we proposed GBSM. Both SVGD and GBSM maintain a set of model instances. As ensemble methods, both inherit the advantages of ensemble models but also their shortcomings most notably in space/time complexity. Thus, instead of maintaining model instances, we maintain only and from which we sample these model instances. We experimented with set to and . As increases, we notice some increase in sample quality at the expense of longer runtimes. One way to alleviate the computational overhead during training is to enforce weight-sharing for the lower layers both for the particles and the model instances . The overhead that occurs as takes on larger values reduces however drastically during inference as we need only to sample the model instances in order to generate adversarial examples.
Preserving Textual Meaning. To generate text adversaries, we experimented with three architecture designs for the decoder : (i.) a transpose CNN, (ii.) a language model, and (iii.) the decoder of a pre-trained ARAE  model. The transpose CNN generates more legible text than the other two designs although we notice sometimes some changes in meaning in the generated adversarial examples. Adversarial text generation is challenging in that small perturbations in the latent codes can go unnoticed at generation whereas high noise levels can render the outputs nonsensical or uninformative. To produce text adversaries that faithfully preserve the meaning of their clean counterparts, we need powerful sentence generators (e.g. BERT  or GPT ) trained on large corpora. Training such large language models requires however time and computational resources. Furthermore, in our experiments, we only considered a vocabulary of size 10,000 words and sentences of length no more than 10 words to align our evaluation with the experimental choices of .
Measuring Perceptual Quality is desirable when the method relied upon to generate adversarial examples uses GANs or VAEs both known to produce often samples of limited quality. From [2, 3]’s results we illustrate in Figure 24 on page 24, we can notice that some of their adversarial images are not legitimate due to mode collapse that GANs suffer from. VAEs have also their own shortcomings in that the Gaussian prior might be too restrictive  to produce meaningful enough latent codes . By using SVGD and GBSM, however, we alleviate this risk since we are not optimizing the ELBO explicitly. Moreover, the sampling of clean and perturbed latent codes is conditioned on the true inputs; which make them thus informative about the inputs. To effectively measure the quality of our adversarial examples, it seems therefore relevant to rely on human evaluation. Thus, for validation purposes, we carry out a pilot study which we detail below. However, a fair side-by-side comparison of our results with [2, 3], either manually or using metrics like mutual information or frechet inception distance , seems impossible as  or  either perform unrestricted targeted attacks – their adversarial examples might totally differ from the true inputs – or they do not target any defenses.
Human Evaluation: We carry out a pilot study to evaluate the coherence of the generated adversarial examples by asking three yes-or-no questions: (Q1) are the adversarial examples semantically sound?, (Q2) are the true inputs similar perceptually or in meaning to their adversarial counterparts? and (Q3) are there any interpretable visual cues in the adversarial images that support their misclassification? For all the datasets, we randomly select a number of samples, generate their adversarial examples and record their classes, and present a questionnaire to 10 human subjects for evaluation.
For MNIST, we pick 50 images (5 for each digit), generate adversarial examples against a 40-step PGD ResNet model with and the certified defenses  and  with . We do the same for SVHN except we only attack the 40-step PGD ResNet model. For CelebA, we also pick randomly 50 images (25 for each gender: male or female), generate adversarial examples against the 40-step PGD ResNet model. We carry out a similar pilot study for the SNLI dataset by considering only adversarial examples generated when the decoder is a transpose CNN. We randomly select 20 pairs of sentences (premise and hypothesis), generate adversarial hypotheses for each sentence pair with the premise sentence kept unchanged. Next, we present the results of this pilot study along with samples of adversarial examples we generate, and compare qualitatively our samples with [2, 3].
Adversarial Images: CelebA
Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the CelebA dataset. We also provide the results of the pilot study.
|True inputs||Adversarial Examples|
|Questionnaire||Against 40-steps PGD - Adversarial CelebA Images (%)|
Question 1 (Q1): Yes
|Question 2 (Q2): Yes||100|
|Question 3 (Q3): No||100|
Adversarial Images: SVHN
Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the SVHN dataset. We also provide the results of the pilot study.
|True inputs||Adversarial Examples|
|Questionnaire||Against 40-steps PGD - Adversarial SVHN Images (%)|
Question 1 (Q1): Yes
|Question 2 (Q2): Yes||100|
|Question 3 (Q3): No||100|
Some adversarial images and clean ones were difficult to evaluate due to blurriness reported in both.
Adversarial Images: MNIST
Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the MNIST dataset. We also provide the results of the pilot study.
|True inputs||Adversarial Examples|
Adversarial Text: SNLI
Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the SNLI dataset. We also provide the results of the pilot study.
|True Input 1||P: A biker races.|
|H: A person is riding a bike.|
|Adversary||H: A man races. Label: Contradiction|
|True Input 2||P: The girls walk down the street.|
|H: Girls walk down the street.|
|Adversary||H: A choir walks down the street. Label: Neutral|
|True Input 3||P: Two wrestlers in an intense match.|
|H: Two wrestlers are competing and are brothers.|
|Adversary||H: Two women are playing together. Label: Contradiction|
|True Input 4||P: A group of people celebrate their asian culture.|
|H: A group of people celebrate.|
|Adversary||H: A group of people cruising in the water. Label: Neutral|
|True Input 5||P: Cheerleaders standing on a football field.|
|H: Cheerleaders are wearing outside.|
|Adversary||H: Person standing on a playing field. Label: Neutral|
|True Input 6||P: People are enjoying food at a crowded restaurant.|
|H: People are eating in this picture.|
|Adversary||H: People are enjoying looking at a crowded restaurant.|
|True Input 7||P: A wrestler celebrating his victory.|
|H: A wrestler won a championship.|
|Adversary||H: A rider won the championship.|
|Questionnaire||Adversarial Hypotheses (SNLI)|
Question 1 (Q1): Yes
|Question 2 (Q2): Yes||75 %|
Manifold Preservation: Illustration with Synthetic Data
To validate empirically that the adversarial examples we generate reside in the manifold of their original counterparts, we evaluate our approach on the 3D non-linear Swiss Roll dataset which comprises 1600 datapoints grouped in 4 classes. We illustrate in Figure 19.a the 2D plots of the manifold learned by our approach before (left) and after (right) perturbing its elements. The points in orange refer to the latent codes (of all the points in blue) whose perturbed versions (in green) induced adversariality. In Figure 19.a, we can see that the topological structure of the manifold is well preserved with the exception of few perturbed points lying a bit away from the boundary delineated by the blue points. As a result, we can see in Figure 19.c and Figure 19.d, how close the adversarial examples are to their clean counterparts.
As argued in the Discussion section, a fair side-by-side comparison of our results and the results from [2, 3] would not be possible given the reasons we discussed before. For illustration purposes only then, and to showcase the differences in sampling quality between our approach and  and , we show below some adversarial examples they generate. Unlike us, some of their images are not valid (see N/A marks). In contrast to us also, few of their adversarial examples, although semantically sound, fail to preserve the structure of the true inputs; for instance, in Figure 24.b row 1, we can notice the changes in structure of source class 1. The same holds for the classes 7, 8, 0 in Figure 24.d row 4.
Appendix B: Detailed Experimental Settings
|Initial Layer||conv. maps. stride.||_|
Residual Block 1
|batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride residual addition|
Residual Block 1
batch normalization, leaky relu
conv. maps stride
batch normalization, leaky relu
conv. maps. stride
average pooling, padding
Residual Block 2
|batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride average pooling, padding|
Residual Block 2
|batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride average pooling, padding||_|
Residual Block 3
|batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride average pooling, padding|
|conv. maps. stride.||_|
|conv. maps. stride.||_|
|Recognition Networks||Input Dim: 50, Hidden Layers: [60, 70], Output Dim: Num weights & biases in|
|Input Dim: 50, Hidden Layers: [60, 70], Output Dim: Num weights & biases in|
|Particles||Input Dim: (MNIST), (CelebA), (SVHN), 300 (SNLI) Hidden Layers: [40, 40] Output Dim (latent code): 100|
|Parameters||Input Dim: (MNIST), (CelebA), (SVHN), 100 (SNLI) Hidden Layers: [40, 40] Output Dim (latent code): 100|
|Input Dim: (MNIST), (CelebA), (SVHN), (SNLI) Hidden Layers: [40, 40] Output Dim: (MNIST), (CelebA), (SVHN), (SNLI)|
|Transpose CNN||For CelebA & SVHN: [filters: 64, stride: 2, kernel: 5] 3 For SNLI: [filters: 64, stride: 1, kernel: 5] 3|
|Language Model||Vocabulary Size: 11,000 words Max Sentence Length: 10 words|
|Input Dim: 200, Hidden Layers: [100, 100, 100], Output Dim: 3|
|Learning Rates||Adam Optimizer|
Batch size: 64, Inner-updates: 3, Training epochs: 500,