Learning Diverse Latent Representations for Improving the Resilience to Adversarial Attacks

06/26/2020 ∙ by Ali Mirzaeian, et al. ∙ University of California-Davis George Mason University AUT University of Maryland, Baltimore County 0

This paper proposes an ensemble learning model that is resistant to adversarial learning attacks. To build resilience, we proposed a training process where each member learns a radically different latent space. Member models are added one at a time to the ensemble. Each model is trained on data set to improve accuracy, while the loss function is regulated by a reverse knowledge distillation, forcing the new member to learn new features and map to a latent space safely distanced from those of existing members. We have evaluated the reliability and performance of the proposed solution on image classification tasks using CIFAR10 and MNIST datasets and show improved performance compared to the state of the art defense methods



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the past decade, the research on Neuromorphic-inspired computing models, and the applications of Deep Neural Networks (DNN) for estimation of hard-to-compute functions, or learning of hard-to-program tasks has significantly grown and their accuracy has considerably improved. Research on learning models was first to focus on improving the accuracy of the models

(Krizhevsky and et al., 2012; He and et al., 2016), but then as models matured, researchers explored other dimensions, such as energy efficiency of the models (Neshatpour and et al., 2018, 2019; Sayadi and et al., 2017) and underlying hardware (Mirzaeian et al., 2020b, a; Chen and et al., 2018). The wide adoption of these capable solutions then started raising concerns over their security.

Among many security aspects of learning solutions, is their vulnerability to adversarial attacks (Szegedy et al., 2013; Goodfellow et al., 2014). Researchers have illustrated that imperceptible, yet targeted adversarial changes to the input (i.e. image, audio, or video input) of neural networks can dramatically drop their performance (Kurakin et al., 2016; Papernot et al., 2016b, a).

The vulnerability of DNNs to adversarial attacks has raised serious concerns for using these models in critical applications in which an adversary can slightly perturb the input to fool the model (Elsayed et al., 2018; Yang and et al., 2018). In this paper, we focus on adversarial attacks on image classification models where an adversary manipulates an input image, forcing the DNN to misclassify.

Non-robust features are those features that are strongly associated within a certain class, yet have small variation across classes (Garg et al., 2018; Behnia and et al., 2020). Ilyas and et al. at (Ilyas et al., 2019) showed that the high sensitivity of the underlying model to the non-robust features existing at the input dataset is a major reason for the vulnerability of the model to the adversarial examples. So an adversary crafts a perturbation that accentuates the non-robust features to achieve a successful adversarial attack.

From this discussion, a means of building robust classifiers is identifying robust features and training a model using only robust features (that have a large intra-class variation), making it harder for an adversary to mislead the classifier,

(Ilyas et al., 2019). Motivated by this discussion, we proposed a simple yet effective method for improving the resilience of DNNs, by introducing auxiliary model(s) trained in the spirit of knowledge distillation, while forcing diversity across features formed in their latent spaces.

We argue that the reason why adversarial attacks are transferable across models is that they learn similar latent spaces for non-robust features. This is either the result of using the 1) same training set or 2) the use of knowledge distillation while solely focusing on improving the classification accuracy. In other words, for sharing the dataset or the knowledge of a trained network (on a dataset), the potential vulnerabilities of models coincide. Hence, an attack that works on one model is very likely to work on the other(s). This conclusion is also supported by the observations by Ilyas and et al. at (Ilyas et al., 2019). From this argument, we propose to augment the task of knowledge distillation with an additional and explicit requirement that the learned features by the student (i.e. auxiliary) model(s) should be distinct and independent from the teacher (i.e. main) model. For this reason, as illustrated in Fig. 1, we introduce the concept of Latent Space Separation (LSS), forcing the auxiliary model to learn features with little or no correlation to those of the teacher’s. Hence, an adversarial attack on the main model will have minimal impact on the latent representation of features learned by the auxiliary model(s).

Figure 1. In our proposed solutions, we train an auxiliary model(s) that tracks the main model’s classification while learning a diverse set of features, latent representation of which maps to space far apart from the teacher’s. Blue dots show the latent space of the main model, and the red dots show the latent space of the auxiliary model. indicates the diversity between the feature space of the model and which are and , respectively.

2. Prior Work

Prior research on adversarial learning has produced different explanations on why leaning models could be easily fooled by adversarial input perturbation. The early research blamed the non-linearity of neural networks for their vulnerability (Goodfellow et al., 2014; Biggio and et al., 2013). However, this perception was later challenged by Goodfellow and et. al. (Goodfellow et al., 2014), who developed the Fast Gradient Sign Method (FGSM), explaining how neural network linearity can be exploited for rapidly building adversarial examples.

Building robust learning models that could resist adversarial examples has been a topic of interest for many researchers. Some of the most notable prior art on this topic includes 1) Adversarial Training (Shaham et al., 2018)(Amberkar et al., 2018), 2) Knowledge Distillation (KD) (Hinton et al., 2015), and 3) de-noising and refinement of the adversarial examples (Meng and Chen, 2017).

Adversarial Training: It is the process of incremental training of a model with the known adversarial examples to improve its resilience. The problem with this approach is that the model’s resilience only improves when the model is attacked with similarly generated adversarial examples (Wang et al., 2019; Tramèr and Boneh, 2019).

Knowledge Distillation (KD): In this method, a compact (student) model learns to follow the behavior of one or more teacher models. It was originally introduced as a means of building compact models (students) from more accurate yet larger models (teachers). But, later it was also used for diminishing the sensitivity of the student’s output model concerning the input’s perturbations (Papernot et al., 2015). However, the work in (Carlini and Wagner, 2016) illustrated that if the attacker has access to the student model, with minor changes, the student model could be as vulnerable as the teacher. Specifically, knowledge distillation can be categorized as a gradient masking defense (Athalye and others., 2018) in which the magnitude of the underlying model’s gradients are reduced to minimize the effect of changes in the model’s input to its output. Although grading masking defenses can be effective defense against white-box attacks, they are not resistant against black-box evasion attacks (Carlini and Wagner, 2016). Our proposed solution is motivated by KD, however, we do not force the auxiliary (i.e. student) network(s) to follow the output layers of the main network (i.e. teacher); In contrary to the KD, the auxiliary network has to learn a different latent space while being trained for the same task and on the same dataset.

Refining the Input Image:

The adversarial defenses that rely on refining the input samples try to denoise the input image using some sort of autoencoder (variational, denoising, etc.)

(Chen and Sirkeci-Mergen, 2018). In this approach, the image is first encoded using a deep network to extract a latent code (a lossy, compressed, yet reconstructable representation of the input image) and then the image is reconstructed using a decoder. Then the decoded image is fed to the classifier (Chen and Sirkeci-Mergen, 2018). However, this approach suffers from two main weaknesses: (1) The reconstruction error of decoder can significantly reduce the classifier’s accuracy and such reconstruction error increases as the number of input classes increases; (2) the used encoder-network is itself vulnerable to adversarial attacks which mean new adversarial examples can be crafted on the model which also include the encoder-network.

3. Proposed Method

Our objective is to formulate a knowledge distillation process in which one or more auxiliary (student) models are trained to closely follow the prediction of a main (teacher) model , while being forced to learn substantially different latent spaces. For example in Fig. 2, let’s assume three auxiliary models , , have been trained alongside the main model to have the maximum diversity between their latent space representations. Our desired outcome is to assure that an adversarial perturbation that could move the latent space of the input sample out of its corresponding class boundary, , has a negligible or small impact on the movement of the corresponding latent space of in the class boundaries of the auxiliary models , , . Hence, an adversarial input that could fool model , becomes less effective or ineffective on the auxiliary models. This objective is reached by the way the loss function of each model is defined. The details of and network(s), learning procedure and objective function of are explained next:

Figure 2. : latent space of the main model, , , : latent spaces of the auxiliary models 1, 2, and 3. is an input sample and the red arrow at the main model shows the direction of the adversarial perturbation, the red arrows for the auxiliary models show the projection of the perturbation on the latent spaces of the auxiliary model.

Main Model (:) In this paper, we evaluated our proposed method on two datasets MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky and et al., 2009). So depending on the underlying dataset, the structure of the main model (teacher) is selected as showed in Table 1. We also employed cross-entropy loss , see Eq. 1, as the objective function for training the model .


In this equation, and are training sample and its label, respectively.

Auxiliary Model (): Each auxiliary model is a structural replica of the main model . However, model is trained using a modified KD training process: let’s denote the output of layer of model (i.e. latent space of model ) and by and respectively. Our training objective is to force model to learn very different latent space compared to when both do the same classification task on the same dataset. To achieve this, the term which shows the similarity of the layer of model A to layer of model is defined as follows:


In this objective function, is the dataset, and the is the inner product function. This similarity measure then is factored in to define the loss function for the training the model , which increases the dissimilarity of the layer of the model with respect to the :


In this equation, is a regularization parameter to control the contribution of each term.

Let’s assume the adversarial perturbation , when added to the input , forces the model to misclassify , or more precisely . For this misclassification (evasion) to happen, in a layer (close to the output) the added noise has forced some of the class-identifying features outside its related class boundary learned by model . However, the class boundaries learned by and are quite different. Therefore, as illustrated in Fig. 2, although noise can move a feature out of its learned class boundary in model , it has very limited power in displacing the features learned by model outside of its class boundary in layer . In other words, although the term between the main and the auxiliary models has a low value, the term between the auxiliary model before and after adding perturbation has a high value, subsequently the student model has a low sensitivity to the perturbation , meaning .

3.1. Black and White-Box defense:

The auxiliary models could be used to defend against both white and black box defenses, description and explanation for each are given next:

Black-box Defense: In the black box attack, an attacker has access to model , and can apply her desired input to the model and monitor the model’s prediction for designing an attack and adding the adversarial perturbation to the input . Considering no access to the model , and for having very different feature space, the model remains resistant to black-box attacks and using a single is sufficient.

White-box Defense: In the white-box attack, the attacker knows everything about the model and , including model parameters and weights, full details of each model’s architecture, and the dataset used for training the network. For this reason, using a single model is not enough, as that model could be used for designing the attack. However, we can make the attack significantly more difficult (and also improve the classification confidence) by training and using multiple robust auxiliary models. However, each of our models learns different features compared with all other auxiliary models. Then, to resist the white-box attack, we create a majority voting system from the robust auxiliary models.

Let’s assume we want to train auxiliary models, , each having a diverse latent space (i.e. ). To learn these networks, firstly, based on 3, the is learned aiming . Then, the is learned to be diverse of both and . This process continues one by one, reaching the model, that its latent space is diverse from all previous models (i.e. ). According to this discussion for learning the i auxiliary model, the loss function is defined as:


Finally, to increase the confidence of the prediction, instead of a simple majority voting system for the top-1 candidate, we consider a boosted defense in which the voting system considers the top candidates of each model

for those cases that majority on top-1 fails (there is no majority between the top-1 predictions). This gives us two benefits 1) if a network misclassify due to adversarial perturbation, there is still a high chance for the network to assign a high probability (but not the highest) to the correct class. 2) if a model is confused between a correct class and a closely related but incorrect class and assign the top-1 confidence to the incorrect class, it still helps in identifying the correct class in the voting system.

4. Experimental Results

In this section, we evaluate the performance of our proposed defense against various white and black-box attacks. All models are trained using Pytorch framework

(Paszke et al., 2019), and all attack scenarios are implemented using Foolbox(Rauber et al., 2017)

(which is a toolbox for crafting various adversarial examples). The details of the hyperparameters and system configuration are shown in Table


width=0.98center MNIST Architecture CIFAR10 Architecture Relu Convolutional 32 filters (33) Relu Convolutional 96 filters (33) Relu Convolutional 32 filters (33) Relu Convolutional 96 filters (33) Max Pooling 22 Relu Convolutional 96 filters (33) Relu Convolutional 64 filters (33) Max Pooling 22 Relu Convolutional 64 filters (33) Relu Convolutional 192 filters (33) Max Pooling 22 Relu Convolutional 192 filters (33) Relu Convolutional 200 units Relu Convolutional 192 filters (33) Relu Convolutional 200 units Max Pooling 22 Softmax 10 units Relu Convolutional 192 filters (33) Relu Convolutional 192 filters (11) Relu Convolutional 192 filters (11) Global Avg. Pooling Softmax 10 units System Configuration and training hyper parameters

OS: Red Hat 7.7, Pytorch: 1.3, FoolBox: 2.3.0, GPU: Nvidia Tesla V100, EPOCH: 100, MNIST Batch Size: 64, CIFAR10 Batch Size:128, Optimizer: ADAM, learning rate: 1e-4

Table 1. The Architecture of each one the ensemble models
Figure 3. training method of the DKD (left), KD (middle), and RI (right). KL-Divergence (Kullback and Leibler, 1951) and Cross-Entropy (Mannor et al., 2005)

are the objective functions that have been used for obtaining the total loss. Cosine-Similarity and Cross-Entropy are the ones used to obtain loss function related to DKD. For the RI only the Cross-Entropy is used.

4.1. Latent Space Separation

To quantify the diversity of the latent space representations of the ensemble trained on a dataset , we first define the Latent Space Separation (LSS) measure between latent spaces of two models and as:


In which , are latent space representations of the dataset D for models and , respectively.

is the normal vector of the hyperplane obtained by Support Vector Machine (SVM) classifier

(Cortes and Vapnik, 1995) for linearly separate the latent spaces obtained on the dataset . More precisely, LSS between two latent spaces is obtained by following these 4 steps: 1) training both models , on a dataset i.e., MNIST. 2) generating the latent space of each model on the evaluation set i.e., and . 3) tuning the latent representations into a two-class classification problem tackled by SVM classifier 4) using SVM margin as LSS distance between two latent representations of the dataset MNIST, . This process has been shown in Fig. 4. Note that the SVM classifier should be set in the hard margin mode, meaning no support vector is allowed to pass the margins. When the SVM classification fails, it means that the latent spaces were not linearly separable i.e., there is either an overlap between latent spaces or the decision boundary cannot be modeled linearly. So in both cases, the more the marginal distance between latent spaces are the higher is the diversity of formed latent spaces.

Figure 4. The hypothetical representation of latent spaces of two models and is shown on top. The bottom figure shows the between these two latent spaces obtained by SVM.

Using Eq. 5, we define more generalized formula of LSS using an ensemble model comprised of models, see Eq. 6. In fact the total LSS of an ensemble model is obtained by averaging the LSS of each model latent space versus all other models’. For instance, Let’s imagine the ensemble model comprised of three models , , and . Then the total LSS is calculated by . LSS measures the marginal distance of SVM in a two-class-classification task, so , indicates that a SVM classification has been performed between the latent space of the model and an aggregation of latent spaces of other two models and .


width=0.98center Param. MNIST CIFAR10 KD KD* DKD DKD* RI RI* KD KD* DKD DKD* RI RI* Deep Fool Deep Fool 1 4 0 15 0 16 0 576 6 1234 51 891 33 200 7 1 93 0 61 0 628 5 1319 52 959 32 A.I.(%) 0.02 0.12 0.35 1.36 2.88 7.16 C&W C&W 1 188 3 172 0 225 6 584 1 1201 2 859 13 200 560 5 794 2 921 16 630 2 1322 2 942 12 A.I.(%) 0.01 0.33 0.22 13.97 1.47 2.79 JSMA JSMA 1 0 0 15 0 16 6 584 8 1201 51 859 31 200 4 1 52 0 42 8 654 8 1410 56 999 35 A.I.(%) 0.01 0.26 0.15 1.39 2.94 4.18 FGSM FGSM 0.04 190 5 342 16 235 20 821 3 1801 106 1560 66 0.08 239 6 589 27 349 34 924 9 1842 71 1872 77 0.1 349 3 844 36 504 46 950 12 1829 79 2053 105 A.I.(%) 0.03 0.21 0.12 1.33 3.74 2.94

Table 2. The number of failed majority between an ensemble of the three models at the original and boosted version, indicated with *, of KD, DKD, and RI on the datasets MNIST and CIFAR10. The accuracy improvement using the boosted version is shown with Accuracy Improved (A.I). We investigated our proposed method against well-know attacks like DeepFool (Moosavi-Dezfooli et al., 2017), C&W (Carlini and Wagner, 2017), JSMA (Papernot et al., 2016b), and FGSM (Goodfellow et al., 2014).

To investigate the effectiveness of LSS, as a metric for measuring the diversity between different latent spaces, we considered three different scenarios for training an ensemble of 3 models with the same structure: I) Random Initialization (RI) where 3 models trained independently with a random initial value, see Fig. 3-right. II) Knowledge Distillation (KD), where 3 models trained collaboratively as shown in Fig. 3-middle, III) Diversity Knowledge Distillation (DKD), where 3 models are trained in a collaborative yet different manner of KD, see Fig. 3-left. Note that KD and RI methods deal with the softmax probabilities, shown with red boxes, however, DKD uses a mix of softmax and the latent spaces, showed with black straps. Alongside each one of the designs DKD, KD, and RI, a boosted version of each one is implemented and denoted by , , and , respectively. Both KD and DKD are trained in a one-by-one manner, meaning the model considers the previously trained models at its training phase while those models are frozen i.e., their parameters (weights) are not updated while the model is being trained.

Figure 5. (A), (B) show the LSS of an ensemble of three models on the MNIST, and CIFAR10 datasets. (C), (D) show the classification accuracy of the ensemble model on the MNIST, and CIFAR10 datasets. In this figure three method Random Initialization(RI), Knowlege Distillation (KD), and Diversity Knowledge Distillation(DKD) are shown.

Fig. 5-top shows the for an ensemble of three models (for MNIST and CIFAR10 datasets), with the structures described in Table. 1. Fig. 5-A and C show that for the MNIST dataset, increasing the value of the causes 1) rapid increase at the 2) slight drop at the classification accuracy. Considering the equation 4, increasing the means putting less emphasis on the cross-entropy term which reflects the slight drop of accuracy and increases the of the latent spaces of the ensemble models. A similar pattern happens for the CIFAR10 dataset, Fig. 5-B and D, in which increases to its maximum at the while the classification accuracy slightly increases. Between all the different values of acceptable accuracies, the value which leads to a higher is selected as the parameter. In other words, to have a diverse latent space, the LSS between them should be maximized while the accuracy is kept in the acceptable range. For example in Fig. 5-B, when is 0.5, the LSS is at the maximum level while accuracy is also at an acceptable level. Accordingly, the following attacks have been performed when the parameters is set to 0.5 and 0.9 for the CIFAR10 and MNIST datasets.

One instant observation in Fig. 5 is that the between the latent spaces obtained by DKD approach is noticeably larger than RI, and the between latent spaces obtained by RI is slightly larger than KD. This observation is aligned with our expectations because the DKD is designed to increase the diversity between the latent spaces while the KD in essence, increases the similarity between models and this is because the student model(s) imitates the behavior of the teacher(s).

width=1.8center MNIST CIFAR10 Param. 3 Ensemble Models Ref. Param. 3 Ensemble Models Ref. KD KD* DKD DKD* RI RI* KD KD* DKD DKD* RI RI* DeepFool(Moosavi-Dezfooli et al., 2017) DeepFool 1 0.9510 0.9542 0.9721 0.9726 0.962 0.9682 0.9754 1 0.8355 0.8475 0.92 0.9463 0.8596 0.92 0.6505 200 0.864 0.8642 0.8904 0.8916 0.8433 0.8768 0.5835 200 0.7972 0.8108 0.8981 0.9269 0.8265 0.8981 0.1586 C&W(Carlini and Wagner, 2017) C&W 1 0.9612 0.9614 0.988 0.9882 0.9721 0.9726 0.9841 1 0.8475 0.959 0.9534 0.9575 0.9326 0.9571 0.9516 200 0.7829 0.783 0.8587 0.862 0.8127 0.8149 0.2 200 0.8014 0.9311 0.9245 0.9321 0.9022 0.9301 0.1543 JSMA(Papernot et al., 2016b) JSMA 1 0.9612 0.9614 0.9882 0.9884 0.9721 0.9726 0.9854 1 0.8475 0.8599 0.9326 0.9571 0.8348 0.8703 0.9516 200 0.8147 0.8148 0.9086 0.9112 0.8403 0.8418 0.4322 200 0.7804 0.7943 0.8851 0.9145 0.7644 0.8062 0.1564 FGSM(Goodfellow et al., 2014) FGSM 0.04 0.9553 0.9555 0.9842 0.9846 0.9638 0.9644 0.952 0.04 0.8412 0.8548 0.899 0.933 0.8296 0.8676 0.4822 0.08 0.9247 0.9249 0.9619 0.9629 0.9333 0.9341 0.864 0.08 0.08 0.7476 0.7615 0.7393 0.7763 0.7267 0.7627 0.2534 0.1 0.8937 0.894 0.9374 0.9395 0.9059 0.9071 0.7847 0.1 0.6998 0.7131 0.6839 0.7213 0.6817 0.7111 0.0867 No Attack No Attack - 0.9612 0.9614 0.9882 0.9884 0.9721 0.9726 0.9854 - 0.9501 0.9512 0.9561 0.9580 0.9385 0.9523 0.9607

Table 3. Black Box adversarial attack on an ensemble of three models on the MNIST and CIFAR10 dataset. At each row the bold number shows the most resistant defense mechanism.

width=1.8center MNIST CIFAR10 Defense Clean FGSM 0.04 FGSM 0.08 JSMA C&W DeepFool Clean FGSM 0.04 FGSM 0.08 JSMA C&W DeepFool No Defense 0.9661 0.9171 0.8249 0.2421 0.4409 0.0 0.9304 0.2033 0.1846 0.2525 0.2548 0.1441 DKD* 0.9884 0.9835 0.9681 0.8849 0.9594 0.8991 0.9604 0.7377 0.7188 0.7788 0.8878 0.7934 DKD* 0.9884 0.9756 0.9329 0.9725 0.9808 0.9605 0.9604 0.8923 0.8623 0.9221 0.9538 0.8751 MagNet(Meng and Chen, 2017) 0.9031 0.6519 0.6141 0.8014 0.4821 0.6518 0.9421 0.7881 0.7041 0.8134 0.7743 0.7679 Pixel Defend(Song et al., 2017) 0.8266 0.7335 0.6711 0.9191 0.7560 0.7394 0.9491 0.7094 0.6911 0.8411 0.8101 0.7816 Defense GAN (Samangouei et al., 2018) 0.9640 0.9450 0.8221 0.9465 0.8012 0.7921 0.9219 0.7551 0.7101 0.8712 0.8020 0.7651

Table 4. white box attack on the ensemble of three models on the MNIST and CIFAR10 datasets. Bold numbers at each columns shows the most resistant method against white box attacks.

4.2. Black-Box Attacks

For launching a black-box attack, the adversary uses a reference model and trains it based on the available dataset. Then knowing the transferability of the adversarial example, the adversary extracts the adversarial example on the reference model and applies it to the models under attack. Assume the adversary used LENET and VGG16 as the reference model for MNIST and CIFAR10 respectively, and the underlying models under attack are an ensemble of three models with the structure which has been shown at Table 1. For investigating the performance of the proposed method (DKD) on black-box attacks, we also considered two other methods RI, and KD for training an ensemble of three models.

We used the majority voting between the ensemble models, however, in some cases, each one of the models results in a different prediction regards to other models, we refer to these cases as failed majorities. So for each one of the available attacks and two datasets (MNIST and CIFAR10) we counted the number of failed majorities. Table 2, shows the difference between the regular and boosted version of each benchmark in regards to the number of the failed majority voting. From this table, we observed that 1) the number of majority voting failures at the DKD is always higher than the other two regular methods. This confirms that our objective function could successfully train diverse models because the majority voting fails whenever the models couldn’t agree on a label. So at the presence of the adversarial example the disagreement between the models which trained with DKD is higher than the others 2) The number of majority failures drop by going from a regular to a boosted model. The effect of this drop on the accuracy is shown by Accuracy Improvement percentage, A.I.

Table 3 show the result of applying some of the state-of-art attacks on the DKD, KD, and RI on the MNIST, and CIFAR10 datasets. Investigating these two Tables two trends reveals 1) Boosted version has better performance than the regular version 2) Almost at all attack scenarios, outperform the other defense scenarios.

Figure 6. Decision boundary of two hypothetical models. Plane A and B are the demonstrations of two input space of the first and second models respectively. Single attacks represent the scenarios when the adversary attacks a single model at a time and applies the adversarial perturbation to the other model, whereas at the Aggregated attacks the adversary attacks both models simultaneously.

4.3. White-Box Attacks

We considered two different white-box scenarios 1) Standalone attack, in which the adversary can only apply the adversarial perturbation to one model at a time. 2) The Aggregated attack, in which the adversary can apply the adversarial perturbation to both models simultaneously. These two scenarios have been explained through a toy example on two models and in Fig. 6. For the standalone attack, the adversary finds the direction as an adversarial perturbation on the model B. In the second step, the adversary applies the same perturbation on the samples of the model A, with the hope that those adversarial perturbations could be transferred to model A. In the forward pass of the model A, the following equation which is a simplification of the model A function is happening . So based on the angle between the latent spaces of the models A and B, based on. Fig. 6, three scenarios are possible, each of which represent a different hypothetical angle between plane , and . When both latent spaces and are orthogonal (case I), a successful adversarial perturbation of one model does not transfer to the other model (causes no perturbation), and vice versa. Hence, a disagreement between these two models can be an indicator of an adversarial example. In two other cases (II, III), the smaller the angle between the latent spaces, the more probable is that the transferrability of adversarial perturbation across models. Note that the projection of the vector on the plane , , shows the direction of the adversarial perturbation for the model A. If this projection is large enough to move the data point out of its class boundary then input misclassifies as .

For the aggregated attack, the adversary obtains the adversarial perturbation of both latent spaces A and B separately and independently. Then the adversary selects the direction of the sum as the adversarial direction, in which by changing the magnitude of the vector on that direction the adversary successfully obtains an adversarial perturbation which can be applied on both the models. In this scenario let’s assume , and are the perturbation vectors at the plane A and B respectively, in which led to an imperceptible adversarial example, and also and are the angles between these two vectors and sum of these two vectors, respectively. Based on the value of between and , three different outcomes can be assumed, 1) Fig.6 Aggregated-II, , in which the projection of on each vector and is less than either of them. In this case, as an adversarial perturbation cannot be a successful attack on either of the models. 2) Fig.6 Aggregated-I, , in which the projection of on each one of the individual vectors and is greater than either of them, which means as an adversarial example can be a successful attack. However the resulted adversarial perturbation in this scenario is large. 3) Fig.6 Aggregated-III, , in which the projection of the on and is equal to either of them, which means as an adversarial example can successfully move out the data point from its class boundary at both planes.

Table 4 captures the result of various adversarial attacks on our proposed solution. For the FGSM attack, the hyper-parameters 0.1 and 0.3 are reported. Iterative attacks are executed for 200 iterations. DKD shows the evaluation of DKD* when the adversary uses the standalone attack, and DKD shows the aggregated attack. As indicated in this table, our proposed solutions outperform prior art defense, illustrating the effectiveness of learning diverse features using our proposed solution.

5. Conclusion

To build robust models that could resist adversarial attacks, we proposed a method for increasing the diversity between the latent space representations of models used within an ensemble. We also introduced Latent Space Separation (distance between the latent space representations of models in the ensemble) as a metric for measuring the robustness of the ensemble to adversarial examples. The evaluation of our proposed solutions against the white and black box attacks indicates that our proposed ensemble model is resistant to adversarial solutions and outperforms prior art solutions.

6. Acknowledgment

This research was supported by the National Science Foundation (NSF Award# 1718538).