Manifold Preserving Adversarial Learning

by   Ousmane Amadou Dia, et al.
Element AI Inc

How to generate semantically meaningful and structurally sound adversarial examples? We propose to answer this question by restricting the search for adversaries in the true data manifold. To this end, we introduce a stochastic variational inference method to learn the data manifold, in the presence of continuous latent variables with intractable posterior distributions, without requiring an a priori form for the data underlying distribution. We then propose a manifold perturbation strategy that ensures the cases we perturb remain in the manifold of the original examples and thereby generate the adversaries. We evaluate our approach on a number of image and text datasets. Our results show the effectiveness of our approach in producing coherent, and realistic-looking adversaries that can evade strong defenses known to be resilient to traditional adversarial attacks


page 9

page 16

page 17

page 18


Approximate Manifold Defense Against Multiple Adversarial Perturbations

Existing defenses against adversarial attacks are typically tailored to ...

Reversible adversarial examples against local visual perturbation

Recently, studies have indicated that adversarial attacks pose a threat ...

Adversarial Gain

Adversarial examples can be defined as inputs to a model which induce a ...

On Need for Topology-Aware Generative Models for Manifold-Based Defenses

ML algorithms or models, especially deep neural networks (DNNs), have sh...

Variational Inference with Latent Space Quantization for Adversarial Resilience

Despite their tremendous success in modelling high-dimensional data mani...

Gotta Catch 'Em All: Using Concealed Trapdoors to Detect Adversarial Attacks on Neural Networks

Deep neural networks are vulnerable to adversarial attacks. Numerous eff...

1 Introduction

Recent developments in adversarial machine learning 

DBLP:journals/corr/abs-1802-00420 ; DBLP:conf/nips/SongSKE18 ; zhao2018generating

have cast serious doubts on the robustness of deep learning models. Although many defense mechanisms 

DBLP:journals/corr/abs-1903-06603 ; sinha2018certifiable ; DBLP:journals/corr/abs-1801-09344 ; DBLP:journals/corr/MadryMSTV17 ; DBLP:journals/corr/abs-1711-00851 have been proposed to alleviate the security risks faced by these models, very few are resilient to attacks DBLP:journals/corr/abs-1802-00420 ; DBLP:journals/corr/CarliniW16a

. Adversarial attacks manipulate data with imperceptible noise so as to cause a classifier to make incorrect predictions 

42503 . In the image domain, adversarial examples are generated to be identical to real images, albeit the precise locations of fine details in the images might still not be preserved. In the text domain, even slight changes to a sentence can alter its readability or corrupt its meaning zhao2018generating . For adversarial samples to be coherent, they need to be valid and convey ideally the meaning of the true inputs. In the text domain especially, they need to be grammatically and linguistically sound. We posit the reason existing approaches fail to enforce such requirements is because the semantics of the true inputs and the adversarial examples, captured in their respective low dense representation or manifold, do not necessarily align. This may result from a poor characterization of the manifold of the true inputs due to rigid assumptions about its structure – such as confining the latent variable model of the manifold to be Gaussian DBLP:conf/nips/SongSKE18 ; zhao2018generating – or because the search for adversarial examples is restricted to uniformly-bounded regions or conducted along suboptimal gradient directions 42503 ; DBLP:journals/corr/KurakinGB16 ; ianj.goodfellow2014 .

In this study, we explore a novel approach to generate coherent adversarial examples in the image and text domains. Our approach is built around the manifold invariance theorem 10.1088/2053-2571/ab0281ch6 and seeks to align the manifolds of the adversarial examples and the true inputs so that the adversarial examples may reflect the semantics of the true inputs. To intuitively introduce our approach, we decompose the adversarial learning problem into: (i.) manifold learning

, where we develop a variational inference technique to encode high-dimensional data into a low dense representation without needing to reparameterize the encoder network 

kingma2014 , and (ii.) perturbation invariance under the manifold invariance concept, where we describe a simple yet efficient perturbation strategy that ensures cases we perturb remain in the manifold of the true inputs. We subsequently illustrate how the rich latent structure exposed in the manifold can be leveraged as a source of knowledge upon which adversarial constraints can be imposed; this while learning the manifold and without resorting to exhaustive search as per common practice DBLP:conf/nips/SongSKE18 ; zhao2018generating ; ianj.goodfellow2014 . Finally, we apply our approach to images and text data in a black-box setting to generate adversarial examples that are coherent and perceptually similar to the true inputs.

The main contributions of our work are thus: (i.) a novel variational inference method for manifold learning in the presence of continuous latent variables with minimal assumptions about their distribution, (ii.) an intuitive perturbation strategy that encourages perturbed elements of a manifold to remain within the manifold, (iii.) an end-to-end pipeline that combines (i.) and (ii.) to generate adversarial examples that follow the same distribution as the inputs, and (iv.) illustration on images and text, as well as empirical validation against strong certified and non-certified adversarial defenses.

2 Problem Setup & Architecture

Problem Setup. Consider a dataset of training examples, their corresponding classes, and a black-box classifier. We want to generate adversarial examples that are coherent. By “coherent”, we mean given an example , we want its adversarial example to come from a distribution similar to of . In particular, we require to be the nearest such instance to in the manifold that defines and subject to adversariality. To produce coherent adversarial examples, we devise our approach around the invariant manifold concept defined in 10.1088/2053-2571/ab0281ch6 which we revisit below.

Definition 2.1.

Manifold Invariance 10.1088/2053-2571/ab0281ch6 . Let be a metric space and a homeomorphism. For any point , a neighborhood of is invariant under if: .

Definition 2.1 stipulates that the topology of is preserved under as any projection of remains in . To see how Definition 2.1 can be enforced in an adversarial setting, we confine for now to an -radius ball around and let be the class of . If there exists a point s.t and , then is adversarial to  athalye2017 and is neighborhood-preserving.

In this paper, we mainly operate in the low embedding space of we denote . Earlier, we restricted the search region for the adversarial examples of to the uniformly-bounded ball . In reality, the search region may not be as well-defined and may even be difficult to characterize especially in the dense regions of . Thus, to adapt Definition 2.1 to , we consider instead two embedding maps and parameterized respectively by and , and use to find points that lie in the vicinity of in such that when mapped back onto , the input space of , they induce adversariality. We assume here that and follow the implicit distributions and 111Treatment of and

as random variables as in 

kingma2014 allows for model averaging yuchen2017 which we leverage..

Definition 2.2.

Adversarial Learning Objective. Let be metric spaces and a mapping function. We want to learn , , and s.t the divergence is arbitrarily small, and for any , is also arbitrarily small and .

Model Architecture. To attain our adversarial learning objective, we propose as a framework the architecture illustrated in Figure 2 that we design according to Definition 2.2. Our framework is essentially a variational auto-encoder with two encoders that learn to approximate the variational posterior over instead of its reparameterized form kingma2014 . To succeed in this task, we introduce two inference mechanisms – implicit manifold learning via Stein variational inference and Gram-Schmidt basis sign method – to draw instances of model parameters from the implicit distributions and that we parameterize the two encoders with. Both encoders optimize the uncertainty inherent to embedding while guaranteeing easy sampling via Bayesian ensembling. The decoder acts as a generative model for crafting adversarial examples and a proxy for creating latent targets in the space in order to optimize the first encoder; a process we refer to as inversion and depict in Figure 2.

Unlike most perturbation-based approaches athalye2017 ; Shaham2018UnderstandingAT that search for adversarial examples in the input space of , we leverage the dense representation of the latent space of ; the intuition being here that the latent space captures well the semantics of . Thus, rather than finding the adversarial example of a given input in , we learn to perturb its latent code in a way that the perturbed version and lie in the same manifold. Then, we construct using our decoder . By learning to efficiently perturb the latent codes, and then mapping these low-dimensional representations back onto the input space, we control the perturbations we inject to the adversarial examples thereby ensuring that they are perceptually similar to the inputs. In the following, we refer to the manifold of as the set of locally Euclidean latent representations (or points) 10.1088/2053-2571/ab0281ch6 of instances of .


Figure 1: Architecture. The model instances and are generated from the networks and and used to sample and given an input , for any . is generated from posterior sampling (in red) of a , via Bayesian ensembling, that is passed to the decoder .







Figure 2: Inversion. During an inner update, noise is fed to to generate . Given , we sample and reconstruct to . Then, using we sample again . Now, becomes the target of . At the next update, becomes the prediction, and we repeat this process to create a target for .

3 Technical Background

Manifold Learning. To uncover structure in some high dimensional data and understand its meta-properties, it is common practice to map to a low dimensional subspace where explanatory hidden features may become apparent. Manifold learning is based on the assumption that the data of interest lies on or near lower dimensional manifolds in its embedding space. In the variational auto-encoder (VAE) kingma2014 setting upon which we build our work, the datapoints are modeled via a decoder with a prior placed on the latent codes . To learn the parameters , one typically maximizes a variational approximation to the empirical expected log-likelihood, , called evidence lower bound (ELBO) and defined by:


To evaluate the objective , we require the ability to sample efficiently from , the encoder distribution which approximates the posterior inference . Specifically, we need a closed and differentiable form for in order to evaluate the expectation. The reparameterization trick kingma2014 provides a simple way to rewrite the expectation

such that the Monte Carlo estimate of

is differentiable w.r.t . More formally, under some mild differentiability conditions kingma2014 , the random variable can be reparameterized using a differentiable transformation of an auxiliary noise variable : where the prior is generally confined to a family of simple and tractable distributions

like a Gaussian distribution.

Stein Variational Gradient Descent (SVGD) qiangliu2016 is a nonparametric variational inference method that combines the advantages of MCMC sampling and variational inference. Unlike variational bayes in VAEs kingma2014 , SVGD does not confine the target distributions it approximates to simple or tractable parametric distributions while still remaining efficient. To approximate the target distribution , SVGD maintains particles , initially sampled from a simple distribution, it iteratively transports via functional gradient descent. Henceforward, we shall consider these particles as instances of model parameters. At iteration , each particle is updated as follows:


where is a step-size and is a positive-definite kernel. In eq. (2), each particle determines its update direction by consulting with other particles and asking their gradients. The importance of the latter particles is weighted according to the distance measure . Closer particles are given higher consideration than those lying further away. The term is a regularizer that acts as a repulsive force between the particles to prevent them from collapsing into one particle. The resulting particles can thus be used to approximate the predictive posterior distribution over :


4 Proposed Method

We first describe a novel variational inference method to learn the manifold of . Then, we introduce a technique to perturb the elements of while ensuring is invariant to the perturbations. Finally, we describe our end-to-end process for generating manifold preserving adversarial examples.

4.1 Implicit Manifold Learning

As described in section 3, it is customary to optimize the ELBO in eq. (1) when training VAEs due to the intractability of the data-likelihood. Moreover, we need a closed and differentiable form for the encoder . This is achieved by reparameterizing the encoder using a differentiable transformation of some auxiliary Gaussian noise variable. Constraining the noise to be Gaussian is however too restrictive 2015arXiv150505770J and may lead to learning poorly the manifold of the data DBLP:journals/corr/ZhaoSE17b . To alleviate this issue, one can minimize the divergence as in yuchen2017 instead of optimizing the ELBO explicitly. In yuchen2017 , for an input , the authors draw latent codes using a recognition network and update them via SVGD. Similar to yuchen2017 , we focus on the term and use a recognition network to sample model instances we denote particles. Unlike yuchen2017 , however, we generate the latent codes using instead of applying dropout noise on the recognition network.

We aim to leverage to map to some manifold in the latent space . This encoding inherently induces uncertainty we ought to capture in order to learn efficiently. Bayesian methods provide a principled way to model uncertainty through the posterior distribution over model parameters. Thus, we let every particle

define the weights and biases of a Bayesian neural network. For large

, maintaining the particles can be computationally prohibitive because of the memory footprint. Furthermore, the need to generate the particles during inference for each test example is undesirable. To remedy this, we train a recognition network parameterized by that takes as input some noise and outputs . The parameters are updated through a small number of gradient steps in order to produce good generalization performance. If we let be the parameters of at iteration , and given which we defined in eq. (2), we get by:


In the following, we shall use SVGD) to denote an SVGD update of using . To apply SVGD, we need to evaluate which no longer depends on the prior but on the posterior (eq. 3) as is observed. Computing this posterior requires having a target for . Hence, we use our decoder to create a target for by first sampling , then reconstructing , and then re-sampling the target . We summarize this procedure in Figure 2.

As in yuchen2017 , plays a role analogous to in eq. (1). It gives us means to sample latent codes from a particle given an input without imposing an explicit and functional parametric form for . Unlike  yuchen2017 , we do not apply SVGD on the latent codes but on the particles that generate the codes. By also being Bayesian, every particle provides some prior information on how to sample these codes. During training and inference, we use Bayesian model averaging to get a latent code from ; formally, if , for instance, then where the posterior is given in eq. (3).

4.2 Perturbation Invariance

Previously, we described a method to encode in its manifold . Here, we focus on learning to perturb the elements of . We require the perturbed elements to reside in so that they exhibit the features of . Intuitively, we seek a linear mapping such that if :  10.1088/2053-2571/ab0281ch6 . Linear spans, for instance, satisfy this condition. We say then that is -invariant or preserved under . Rather than finding directly, we introduce a new set of model instances and let every denote the weights and biases of a Bayesian neural network. Then, if and is its latent code with in eq.(3), we can set

We leverage the local smoothness of to learn each in a way to encourage to reside in in a close neighborhood of using a novel technique called Gram-Schmidt Basis Sign Method.

Gram-Schmidt Basis Sign Method (GBSM). Let be a minibatch of samples of . For , let be a set of latent codes where and . We aim to learn in a way to generate perturbed versions of along the directions of an orthonormal basis . Given that is locally Euclidean, we can compute the dimensions of its subspace by applying Gram-Schmidt to orthogonalize the span of representative local points. We formalize this objective as an optimization problem where we jointly learn the perturbations , the directions along which we should perturb , and . Formally, we minimize the objective :


To describe more the minimization process in eq. (5), we first sample a model instance given . Then, we generate the codes for all . We orthogonalize

and find the noise tensor

that minimizes along the directions of the basis vectors . The goal is to get the values of to be as small as possible. With fixed, we update by minimizing again.

Maintaining for a large can however be computationally prohibitive. Thus, as in section 4.1, we use a recognition network that takes as input some noise and generates . Here also, using eq. (5), we update through a small number of gradient steps.


In the following, we use the notation GBSM where to update at once.

Distribution Alignment. Although GBSM confers us clear benefits in the forms of latent noise imperceptibility and sampling speed, may deviate from in which case the manifolds they learn may differ. This understandably contradicts the requirement in Definition 2.2 that the divergence between the implicit distributions and , hence between their samples and , needs to be small; the intuition being to get the adversarial examples to reflect the semantics of the inputs by aligning their manifolds. To remedy this, we further regularize each after every GBSM update. In essence, we slightly modify SVGD to ensure the model instances follow the transform maps constructed by  junhan2017 at each iteration using the operator :


Unlike in eq. (2), with the parameters determine their own update direction by consulting the particles alone instead of consulting each other. In the following, we shall use SVGD to denote the application of the gradient update rule in eq. (7) using the kernelized operator .

4.3 Manifold Preserving Adversarial Learning

In general, one can produce adversarial examples via white-box ianj.goodfellow2014 or black-box Papernot:2017:PBA:3052973.3053009 attack. In the more standard black-box case, which is the main focus of this study, we only have access to the predictions of a classifier and need to generate adversarial examples not knowing the intricacies of

nor having access to its loss function. In this setting, adversarial examples are produced by maximizing an auxiliary loss, here the log-likelihood

, of a target class over an -radius ball around the input  athalye2017 . This is formalized as:


Previously, we decomposed the adversarial learning problem into: (i.) manifold learning where we develop a new method to encode high-dimensional data into a low dense representation that faithfully characterizes its topological structure, and (ii.) manifold perturbation where we ensure cases we perturb remain in the manifold of the true inputs. Here, we show how both methods can be unified in one learning procedure to generate adversarial examples without resorting to exhaustive search.

We illustrate this end-to-end procedure in Algorithm 1 and describe it here. First, we sample , the particles that learn to model the distribution of in its manifold , from . Then, we sample , a set of transformations from we want to be invariant to under the perturbations these transformations induce. Both and act as encoders of but more faithfully than . Both also attempt to capture the complex uncertainty structure of by learning to approximate its predictive posterior. To optimize , we have to compute this posterior. We do so after the first inner update by using the inversion mechanism we described in Figure 2 in an iterative fashion. From updating , we optimize

using stochastic gradient descent. Then, we apply the objective

in eq. (7) to optimize , and subsequently update . This summarizes the inner-training of the recognition networks and . To generate adversarial examples, we extend the objective in eq. (8) as follows:


where is the reconstruction error of that we formalize in section 6. The second term is the cost incurred when failing to fool the classifier and is the strength of the adversarial examples.

  for  to (number of inner updates) do
     Compute and where and
     Given , sample and using and , then sample and
     if  then
         SVGD # COMMENTapply the inversion on and update
          GBSM COMMENTupdate and alternatively using GBSM
          SVGD COMMENT alignment of with and update of
     end if
  end for
  Compute loss in eq. (9) on and propagate back its gradient.
Algorithm 1 Manifold preserving adversarial learning. is the distance.

5 Related Work

Manifold Learning. VAEs are generally used to learn manifolds DBLP:journals/corr/abs-1808-06088 ; 2018arXiv180704689F ; higgins2016 by maximizing the ELBO of the data log-likelihood DBLP:journals/corr/abs-1711-00464 ; 2017arXiv170901846C . Optimizing the ELBO requires to reparameterize the encoder to a simple distribution like a Gaussian kingma2014 . The Gaussian prior assumption is however too restrictive 2015arXiv150505770J and may lead to learning poorly the manifold of the data DBLP:journals/corr/ZhaoSE17b . To alleviate this issue, one can optimize instead the divergence between the encoder and the posterior inference as in yuchen2017 . Although our work and yuchen2017 are similar since we both use recognition networks, our approach is more general. In our case, generates the particles, which are model instances, from which we can sample infinite latent codes rather than finite pointwise estimates yuchen2017 . Also, given that our particles are Bayesian, we learn to capture better the uncertainty inherent to encoding data using VAEs.

Adversarial Examples. Studies in adversarial deep learning DBLP:journals/corr/abs-1802-00420 ; DBLP:journals/corr/KurakinGB16 ; ianj.goodfellow2014 ; athalye2017 can be categorized into two groups. The first group DBLP:journals/corr/CarliniW16a ; athalye2017 ; DBLP:journals/corr/Moosavi-Dezfooli16 proposes to generate adversarial examples directly in the input space by distorting, occluding or changing illumination in images to cause changes in classification. The second group DBLP:conf/nips/SongSKE18 ; zhao2018generating , where our work belongs, uses generative models to search for adversarial examples in the dense and continuous representation of the data rather than in its input space.

Adversarial Images. Song et al. DBLP:conf/nips/SongSKE18 propose to construct unrestricted adversarial examples in the image domain by training a conditional GAN that constrains the search region for a latent code in the neighborhood of a target . Zhao et al. zhao2018generating use also a GAN to map input images to a latent space where they conduct their search for adversarial examples. Both studies are closely related to ours. Unlike DBLP:conf/nips/SongSKE18 , however, our adversarial examples are not restricted. Contrary to DBLP:conf/nips/SongSKE18 and zhao2018generating , we do not need to set the noise level to inject to the latent codes of the true inputs as the perturbations are automatically learned. Moreover, in contrast to DBLP:conf/nips/SongSKE18 and zhao2018generating , where the search for adversarial examples is exhaustive and decoupled from the training of the GANs, our approach is end-to-end. Lastly, by capturing the uncertainty induced by the mapping of the data to latent space, we learn to characterize the semantics of the data better; which allows us thus to generate more realistic adversarial examples.

Adversarial Text.

Previous studies on adversarial text generation 

pmlr-v80-zhao18b ; DBLP:journals/corr/JiaL17 ; DBLP:journals/corr/Alvarez-MelisJ17 ; DBLP:journals/corr/LiMJ16a

perform word erasures and replacements directly in the input space using domain-specific rules or heuristics or require manual curation. Similar to us, Zhao et al. 

zhao2018generating propose to search for textual adversarial examples in the latent representation of the data. However, in addition to the differences mentioned above, the search for adversarial examples is handled more gracefully in our case thanks to an efficient gradient-based optimization method in lieu of a computationally expensive search in the latent space.

6 Experiments & Results

We evaluate our approach on a number of black-box classification tasks of images and text, and illustrate in Appendix A with synthetic data that our adversarial examples are manifold-preserving.

Image Datasets. For image classification, we experiment with three standard datasets: MNIST LeCun:1989:BAH:1351079.1351090 , CelebA liu2015faceattributes and SVHN 37648 . For MNIST and SVHN, each image of a digit represents the class the image belongs to. For CelebA, we group the face images according to their gender (female, male), similar to DBLP:conf/nips/SongSKE18 , and focus on gender classification.

Text Datasets. For text classification, we consider the SNLI DBLP:journals/corr/BowmanAPM15 dataset. SNLI consists of sentence pairs where each pair contains a premise and a hypothesis, and a label indicating the relationship (entailment, neutral, contradiction) between the premise and hypothesis. For instance, the following pair is assigned the label entailment to indicate that the premise entails the hypothesis.
Premise: A soccer game with multiple males playing. Hypothesis: Some men are playing a sport.

Model Settings.

We embed the image and text inputs using a convolutional neural network (CNN). To generate adversarial images, we design the decoder

as (i.) a transpose CNN, and (ii.) a pre-trained GAN pmlr-v70-odena17a . For adversarial text generation, we consider the following designs for : (i.) a transpose CNN, (ii.) a language model, and (iii.) using a pre-trained ARAE pmlr-v80-zhao18b model. The recognition networks and

are fully connected neural networks of similar architecture. To evaluate our approach, we target a range of adversarial models. For more details on the architectures of these models, features extraction,

, , and the decoder , we refer the reader to Appendix B.

6.1 Generating Adversarial Images

Setup. As argued in DBLP:journals/corr/abs-1802-00420 , the strongest non-certified defense available against adversarial attacks is adversarial training with Projected Gradient Descent (PGD) DBLP:journals/corr/MadryMSTV17 . Thus, we evaluate the strength of our MNIST, CelebA and SVHN adversarial examples against adversarially trained ResNet DBLP:journals/corr/HeZRS15 models with a 40-step PGD and noise margin less than 0.3. The ResNet models follow the architecture design of  DBLP:conf/nips/SongSKE18 . For MNIST, we also target the certified defenses DBLP:journals/corr/abs-1801-09344 and DBLP:journals/corr/abs-1711-00851 with set to 0.1.

We formalize in eq. (9) respectively as two -norm losses and a discriminative loss. For the p-norm case, we first reconstruct using the true inputs as targets, i.e., where and are samples we generate. For the latter, we use a pre-trained Wasserstein formulation of AC-GAN pmlr-v70-odena17a to generate samples and set . Finally, we use the discriminator of the same GAN to discriminate between and . We experiment with all three variants and report only the results for where the adversarial success rates and sample quality are higher.

Adversarial Success Rate (ASR). Against the non-certified defenses, we achieve an ASR of 97.2% for MNIST, 87.6% for SVHN, and 84.4% for CelebA. The certified defenses on MNIST guarantee that no attack with less than 0.1 can have a success rate larger than 35% and 5.8% respectively DBLP:journals/corr/abs-1801-09344 ; DBLP:journals/corr/abs-1711-00851 . However, our ASR against these defenses is 95.2% and 96.8%. For all the datasets, the latent noise levels are smaller than , and the accuracies of the target models are higher than 96.3%; thus proving the effectiveness of our method in generating strong adversarial examples.

Noise Level. We compute the latent strength of the adversarial examples by evaluating the average spectral norm of the learned perturbations over few minibatch samples. For MNIST, CelebA, and SVHN, the latent nuisances are respectively , and , all smaller than , and significantly lower than in DBLP:conf/nips/SongSKE18 222Note that our results are not directly comparable with Song et al. DBLP:conf/nips/SongSKE18 as their reported success rates are for targeted unrestricted adversarial examples manually computed from Amazon MTurkers votes unlike ours.. Beyond the imperceptibility of our adversarial examples, these noise levels show that the distributions the particles and the model instances follow are similar. This is further illustrated in Figure 7, with the KDE plots, by the overlapping marginal distributions of the clean and perturbed latent codes generated respectively from using and ; thus resulting in good sample quality as exemplified in Figure 14.

(b) CelebA
(c) SVHN
(d) SNLI
Figure 7: Marginal distributions of clean (blue) and perturbed (red) latent codes on few minibatches.


Figure 14: True inputs - Adversaries (red boxes) for (a)-(b) MNIST, (c)-(d) CelebA, (e)-(f) SVHN. See Appendix A for higher resolution images and the pilot study to manually evaluate sample quality.
True Input 1 P: A biker races. H: A person is riding a bike. Label: Entailment
Adversary 1 H: A man races. Label: Contradiction
True Input 2 P: The girls walk down the street. H: Girls walk down the street. Label: Entailment
Adversary 2 H: A choir walks down the street. Label: Neutral
True Input 3 P: Two dogs playing fetch. H: Two puppies play with a red ball. Label: Neutral
Adversary 3 H: Two people play in the snow. Label: Contradiction
Table 1: Test samples and adversarial hypotheses: (P) for premise, (H) for Hypothesis.

6.2 Generating Adversarial Text

Setup. We perturb the hypotheses sentences to attack our SNLI classifier with the premise sentences kept unchanged. Similar to zhao2018generating , we use ARAE pmlr-v80-zhao18b for word embedding, and a CNN for sentence embedding. To generate adversarial sentences from the perturbed latent codes, we experiment with three decoders: (i.) is a transpose CNN, (ii.) is a language model, and (iii.) we use the decoder of a pre-trained ARAE pmlr-v80-zhao18b model. In all three cases, we use as reconstruction loss (see eq. 9) the cross-entropy loss and condition the generation of the adversarial hypotheses on the sentence pairs premises. We detail the configuration design of each decoder in Appendix B. We generate adversarial text at word level using a vocabulary of 11,000 words only similar to zhao2018generating .

Adversarial Success Rate (ASR). Using the transpose CNN, we achieve an ASR of 67.28% against the SNLI classifier that has an accuracy of 89.42%. Table 1 shows some examples of the generated adversarial hypotheses. The adversarial examples look legible and informative, and convey generally the meaning of the perturbed sentences. Sometimes, however, we notice slight changes in the meaning of the true hypotheses. We do not report the results we achieve with the other two designs of the decoder as the generated adversarial examples are mostly illegible. We posit that this is due to the compounding effect of the perturbations on the language model and a distribution shift with ARAE. We discuss these limitations and provide more examples of adversarial text samples in Appendix A.

7 Conclusion

Many adversarial learning approaches fail to enforce the semantic-relatedness that ought to exist between true inputs and their adversarial counterparts. Motivated by this fact, we developed an approach tailored to ensuring that the true inputs and their adversarial examples exhibit similar semantics by restricting the search for adversarial examples in the manifold of the true inputs. Our success rates against certified and non-certified defenses known to be resilient to traditional adversarial attacks illustrate also the effectiveness of our approach in generating strong adversarial examples.

=0mu plus 1mu


Appendix A: Discussion & Additional Results

Discussion: Here, we discuss some of the choices pertaining to the design of our approach, and our pilot study to manually evaluate adversarial examples we generate. We also discuss the limitations of our method and argue why a fair side-by-side comparison with [2, 3] is impossible. Finally, via empirical validation using toy data, we show that our adversarial examples are manifold-preserving.

Space/Time Complexity. As noted in [19], the Gaussian prior assumption in VAEs might be too restrictive to generate meaningful enough latent codes [20]. To relax this assumption and produce informative and diverse latent codes, we used SVGD. To generate manifold preserving adversarial examples, we proposed GBSM. Both SVGD and GBSM maintain a set of model instances. As ensemble methods, both inherit the advantages of ensemble models but also their shortcomings most notably in space/time complexity. Thus, instead of maintaining model instances, we maintain only and from which we sample these model instances. We experimented with set to and . As increases, we notice some increase in sample quality at the expense of longer runtimes. One way to alleviate the computational overhead during training is to enforce weight-sharing for the lower layers both for the particles and the model instances . The overhead that occurs as takes on larger values reduces however drastically during inference as we need only to sample the model instances in order to generate adversarial examples.

Preserving Textual Meaning. To generate text adversaries, we experimented with three architecture designs for the decoder : (i.) a transpose CNN, (ii.) a language model, and (iii.) the decoder of a pre-trained ARAE [29] model. The transpose CNN generates more legible text than the other two designs although we notice sometimes some changes in meaning in the generated adversarial examples. Adversarial text generation is challenging in that small perturbations in the latent codes can go unnoticed at generation whereas high noise levels can render the outputs nonsensical or uninformative. To produce text adversaries that faithfully preserve the meaning of their clean counterparts, we need powerful sentence generators (e.g. BERT [39] or GPT [40]) trained on large corpora. Training such large language models requires however time and computational resources. Furthermore, in our experiments, we only considered a vocabulary of size 10,000 words and sentences of length no more than 10 words to align our evaluation with the experimental choices of [3].

Measuring Perceptual Quality is desirable when the method relied upon to generate adversarial examples uses GANs or VAEs both known to produce often samples of limited quality. From [2, 3]’s results we illustrate in Figure 24 on page 24, we can notice that some of their adversarial images are not legitimate due to mode collapse that GANs suffer from. VAEs have also their own shortcomings in that the Gaussian prior might be too restrictive [19] to produce meaningful enough latent codes [20]. By using SVGD and GBSM, however, we alleviate this risk since we are not optimizing the ELBO explicitly. Moreover, the sampling of clean and perturbed latent codes is conditioned on the true inputs; which make them thus informative about the inputs. To effectively measure the quality of our adversarial examples, it seems therefore relevant to rely on human evaluation. Thus, for validation purposes, we carry out a pilot study which we detail below. However, a fair side-by-side comparison of our results with [2, 3], either manually or using metrics like mutual information or frechet inception distance [41], seems impossible as [2] or [3] either perform unrestricted targeted attacks – their adversarial examples might totally differ from the true inputs – or they do not target any defenses.

Human Evaluation: We carry out a pilot study to evaluate the coherence of the generated adversarial examples by asking three yes-or-no questions: (Q1) are the adversarial examples semantically sound?, (Q2) are the true inputs similar perceptually or in meaning to their adversarial counterparts? and (Q3) are there any interpretable visual cues in the adversarial images that support their misclassification? For all the datasets, we randomly select a number of samples, generate their adversarial examples and record their classes, and present a questionnaire to 10 human subjects for evaluation.

For MNIST, we pick 50 images (5 for each digit), generate adversarial examples against a 40-step PGD ResNet model with and the certified defenses [6] and [8] with . We do the same for SVHN except we only attack the 40-step PGD ResNet model. For CelebA, we also pick randomly 50 images (25 for each gender: male or female), generate adversarial examples against the 40-step PGD ResNet model. We carry out a similar pilot study for the SNLI dataset by considering only adversarial examples generated when the decoder is a transpose CNN. We randomly select 20 pairs of sentences (premise and hypothesis), generate adversarial hypotheses for each sentence pair with the premise sentence kept unchanged. Next, we present the results of this pilot study along with samples of adversarial examples we generate, and compare qualitatively our samples with [2, 3].

Adversarial Images: CelebA

Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the CelebA dataset. We also provide the results of the pilot study.

True inputs Adversarial Examples
Table 2: CelebA samples and their adversarial examples. Only the images in red boxes are adversarial; i.e., the gender gets changed to female if the corresponding true images were male and vice-versa.
Questionnaire Against 40-steps PGD - Adversarial CelebA Images (%)

Question 1 (Q1): Yes
Question 2 (Q2): Yes 100
Question 3 (Q3): No 100
Table 3: Pilot Study

Adversarial Images: SVHN

Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the SVHN dataset. We also provide the results of the pilot study.

True inputs Adversarial Examples
Table 4: SVHN samples and their adversarial examples. Images in red boxes are all adversarial.
Questionnaire Against 40-steps PGD - Adversarial SVHN Images (%)

Question 1 (Q1): Yes
Question 2 (Q2): Yes 100
Question 3 (Q3): No 100
Table 5: Pilot Study

Some adversarial images and clean ones were difficult to evaluate due to blurriness reported in both.

Adversarial Images: MNIST

Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the MNIST dataset. We also provide the results of the pilot study.

True inputs Adversarial Examples
Table 6: MNIST samples and their adversarial examples. Images in red boxes are all adversarial.
Questionnaire Against 40-steps PGD (%) Against Aditi et al. [6] (%) Against Wong et al. [8] (%)

Question 1 (Q1): Yes
100 100 100
Question 2 (Q2): Yes 100 100 100
Question 3 (Q3): No 100 100 100
Table 7: Pilot Study

Adversarial Text: SNLI

Here, we provide few random samples of non-targeted adversarial examples we generate with our approach on the SNLI dataset. We also provide the results of the pilot study.

True Input 1 P: A biker races.
H: A person is riding a bike.
Label: Entailment
Adversary H: A man races. Label: Contradiction
True Input 2 P: The girls walk down the street.
H: Girls walk down the street.
Label: Entailment
Adversary H: A choir walks down the street. Label: Neutral
True Input 3 P: Two wrestlers in an intense match.
H: Two wrestlers are competing and are brothers.
Label: Neutral
Adversary H: Two women are playing together. Label: Contradiction
True Input 4 P: A group of people celebrate their asian culture.
H: A group of people celebrate.
Label: Entailment
Adversary H: A group of people cruising in the water. Label: Neutral
True Input 5 P: Cheerleaders standing on a football field.
H: Cheerleaders are wearing outside.
Label: Entailment
Adversary H: Person standing on a playing field. Label: Neutral
True Input 6 P: People are enjoying food at a crowded restaurant.
H: People are eating in this picture.
Label: Entailment
Adversary H: People are enjoying looking at a crowded restaurant.
Label: Contradiction
True Input 7 P: A wrestler celebrating his victory.
H: A wrestler won a championship.
Label: Entailment
Adversary H: A rider won the championship.
Label: Contradiction
Table 8: Examples of adversarially generated hypotheses with the true premises kept unchanged.
Questionnaire Adversarial Hypotheses (SNLI)

Question 1 (Q1): Yes
83 %
Question 2 (Q2): Yes 75 %
Table 9: Pilot Study

Manifold Preservation: Illustration with Synthetic Data

To validate empirically that the adversarial examples we generate reside in the manifold of their original counterparts, we evaluate our approach on the 3D non-linear Swiss Roll dataset which comprises 1600 datapoints grouped in 4 classes. We illustrate in Figure 19.a the 2D plots of the manifold learned by our approach before (left) and after (right) perturbing its elements. The points in orange refer to the latent codes (of all the points in blue) whose perturbed versions (in green) induced adversariality. In Figure 19.a, we can see that the topological structure of the manifold is well preserved with the exception of few perturbed points lying a bit away from the boundary delineated by the blue points. As a result, we can see in Figure 19.c and Figure 19.d, how close the adversarial examples are to their clean counterparts.

Figure 19: (a) Swissroll manifold before and after perturbing its elements, (b) 3D view of the Swissroll dataset, (c) Few instances of class 3 (green) misclassified as class 4 (purple), (d) Few instances of class 3 (blue) misclassified as class 2 (purple). The yellow color in (c) and (d) represents the datapoints not subjected to adversariality.

Adversarial Sampling Quality: High-Level Comparison with Song et al. [2] & Zhao et al. [3]

As argued in the Discussion section, a fair side-by-side comparison of our results and the results from [2, 3] would not be possible given the reasons we discussed before. For illustration purposes only then, and to showcase the differences in sampling quality between our approach and [2] and [3], we show below some adversarial examples they generate. Unlike us, some of their images are not valid (see N/A marks). In contrast to us also, few of their adversarial examples, although semantically sound, fail to preserve the structure of the true inputs; for instance, in Figure 24.b row 1, we can notice the changes in structure of source class 1. The same holds for the classes 7, 8, 0 in Figure 24.d row 4.

Figure 24: (a)-(c) From Song et al. [2]’s results, (d) From Zhao et al. [3]’s results.

Appendix B: Detailed Experimental Settings

Configuration Replicate Block
Initial Layer conv. maps. stride. _

Residual Block 1
batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride residual addition

Residual Block 1
batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride

average pooling, padding


Residual Block 2
batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride average pooling, padding

Residual Block 2
batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride average pooling, padding _

Residual Block 3
batch normalization, leaky relu conv. maps stride batch normalization, leaky relu conv. maps. stride average pooling, padding

Pooling Layer
conv. maps. stride. _

Output Layer
conv. maps. stride. _
Table 10: ResNet Classifier.

Name Configuration
Recognition Networks Input Dim: 50, Hidden Layers: [60, 70], Output Dim: Num weights & biases in
Input Dim: 50, Hidden Layers: [60, 70], Output Dim: Num weights & biases in

Model Instances
Particles Input Dim: (MNIST), (CelebA), (SVHN), 300 (SNLI) Hidden Layers: [40, 40] Output Dim (latent code): 100
Parameters Input Dim: (MNIST), (CelebA), (SVHN), 100 (SNLI) Hidden Layers: [40, 40] Output Dim (latent code): 100

Feature Extractor
Input Dim: (MNIST), (CelebA), (SVHN), (SNLI) Hidden Layers: [40, 40] Output Dim: (MNIST), (CelebA), (SVHN), (SNLI)

Transpose CNN For CelebA & SVHN: [filters: 64, stride: 2, kernel: 5] 3 For SNLI: [filters: 64, stride: 1, kernel: 5] 3
Language Model Vocabulary Size: 11,000 words Max Sentence Length: 10 words

SNLI classifier
Input Dim: 200, Hidden Layers: [100, 100, 100], Output Dim: 3
Learning Rates Adam Optimizer
More settings

Batch size: 64, Inner-updates: 3, Training epochs: 500,

Table 11: Model Configurations + SNLI Classifier + Hyper-parameters.