EG-Booster: Explanation-Guided Booster of ML Evasion Attacks

08/31/2021 ∙ by Abderrahmen Amich, et al. ∙ University of Michigan 0

The widespread usage of machine learning (ML) in a myriad of domains has raised questions about its trustworthiness in security-critical environments. Part of the quest for trustworthy ML is robustness evaluation of ML models to test-time adversarial examples. Inline with the trustworthy ML goal, a useful input to potentially aid robustness evaluation is feature-based explanations of model predictions. In this paper, we present a novel approach called EG-Booster that leverages techniques from explainable ML to guide adversarial example crafting for improved robustness evaluation of ML models before deploying them in security-critical settings. The key insight in EG-Booster is the use of feature-based explanations of model predictions to guide adversarial example crafting by adding consequential perturbations likely to result in model evasion and avoiding non-consequential ones unlikely to contribute to evasion. EG-Booster is agnostic to model architecture, threat model, and supports diverse distance metrics used previously in the literature. We evaluate EG-Booster using image classification benchmark datasets, MNIST and CIFAR10. Our findings suggest that EG-Booster significantly improves evasion rate of state-of-the-art attacks while performing less number of perturbations. Through extensive experiments that covers four white-box and three black-box attacks, we demonstrate the effectiveness of EG-Booster against two undefended neural networks trained on MNIST and CIFAR10, and another adversarially-trained ResNet model trained on CIFAR10. Furthermore, we introduce a stability assessment metric and evaluate the reliability of our explanation-based approach by observing the similarity between the model's classification outputs across multiple runs of EG-Booster.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Machine Learning (ML) models are vulnerable to test-time evasion attacks called adversarial examples –adversarially-perturbed inputs aimed to mislead a deployed ML model (FGSM). Evasion attacks have been the subject of recent research (FGSM; BIM; PGSM; MIM; CW; HSJA20; uesato2018adversarial) that led to understanding the potential threats posed by these attacks when ML models are deployed in security-critical settings such as self-driving vehicles (DL-autnonmous17), malware classification  (MalConvEvade18), speech recognition (DL-Speech2012)

, and natural language processing 


Given a ML model and an input with a true label , the goal of a typical evasion attack is to perform minimal perturbations to and obtain similar to such that is fooled to misclassify as . For a defender whose goal is to conduct pre-deployment robustness assessment of a model to adversarial examples, one needs to adopt a reliable input crafting strategy that can reveal the potential security limitations of the ML system. In this regard, a systematic examination of the suitability of each feature as a potential candidate for adversarial perturbations can be guided by the contribution of each feature in the classification result (amich2021explanationguided). In this context, recent progress in feature-based ML explanation techniques (LIME; SHAP; LEMNA) is interestingly positioned inline with the robustness evaluation goal. In fact, ML explanation methods have been recently utilized to guide robustness evaluation of models against backdoor poisoning of ML models  (EG-Poisoning), model extraction (EG-model-extraction19; EG-model-extraction20), and membership inference attacks  (EG-mem-inference2021), which highlights the utility of ML explanation methods beyond ensuring the transparency of model predictions.

In this paper, we present EG-Booster, an explanation-guided evasion booster that leverages techniques from explainable ML (LIME; SHAP; LEMNA) to guide adversarial example crafting for improved robustness evaluation of ML models before deploying them in security-critical settings. Inspired by a case study in (amich2021explanationguided), our work is the first to leverage black-box model explanations as a guide for systematic robustness evaluation of ML models against adversarial examples. The key insight in EG-Booster is the use of feature-based explanations of model predictions to guide adversarial example crafting. Given a model , a -dimensional input sample , and a true prediction label such that

, a ML explanation method returns a weight vector

where each quantifies the contribution of feature to the prediction . The sign of represents the direction of feature with respect to . If , is directed towards (in this case we call a positive feature). In the opposite case, i.e., if , is directed away from (in this case we call a negative feature). EG-Booster leverages signs of individual feature weights in two complementary ways. First, it uses the positive weights to identify positive features that are worth perturbing and introduces consequential perturbations likely to result in that will be misclassified as . Second, it uses negative weights to identify negative features that need not be perturbed and eliminates non-consequential perturbations unlikely to contribute to the misclassification goal. EG-Booster is agnostic to model architecture, adversarial knowledge and capabilities (e.g., black-box, white-box), and supports diverse distance metrics (e.g., common norms) used previously in the adversarial examples literature.

In an orthogonal line of research that studied explanation stability and adversarial robustness of model explanations, some limitations of explanation methods have been documented (explainEval2020; explainEval). Recognizing these potential limitations which entail systematic vetting of the reliability of ML explanation methods before using them for robustness assessment of ML models, we introduce an explanation assessment metric called -Stability, which measures the average stability of evasion results based on the similarities of the target model’s predictions returned by multiple runs of EG-Booster.

We evaluate EG-Booster through comprehensive experiments on two benchmark datasets (MNIST and CIFAR10), across white-box and black-box attacks, on state-of-the-art undefended and defended models, covering commonly used norms. From white-box attacks we use the Fast Gradient Sign Method (FGS) (FGSM), the Basic Iterative Method (BIM) (BIM), the Projected Gradient Descent method (PGD) (PGSM), and the Carlini and Wagner attack (C&W) (CW). From black-box attacks, we use the Momentum Iterative Method (MIM) (MIM), the HopSkipJump Attack (HSJA) (HSJA20), and the Simultaneous Perturbation Stochastic Approximation (SPSA) attack (uesato2018adversarial).

Across all studied models, we observe a significant increase in the evasion rate of studied baseline attacks after combining them with EG-Booster. Particularly, results show an average increase of of the evasion rate across all attacks performed on undefended MNIST-CNN model. Similar findings are observed on undefended CIFAR10-CNN models, and a defended CIFAR10-ResNet model with average evasion evasion rate increase, respectively, of and . In addition to the reduction of the total number of perturbations compared to the baseline attacks, these findings prove that ML explanation methods can be harnessed to guide ML evasion attacks towards more consequential perturbations. Furthermore, the stability analysis results show that, EG-Booster’s outputs are stable across different runs. Such findings highlight reliability of explanation methods to boost evasion attacks, despite their stability concerns. More detailed results are discussed in section 5.

In summary, this paper makes the following contributions:

  • Explanation-Guided Evasion Booster. We introduce the first approach that leverages ML explanation methods towards evaluating robustness of ML models to adversarial examples.

  • Stability Analysis. We introduce a novel stability metric that enables vetting the reliability of ML explanation methods before they are used to guide ML robustness evaluation.

  • Comprehensive Evaluation. We conduct comprehensive evaluations on two benchmark datasets, four white-box attacks, three black-box attacks, different distance metrics, on undefended and defended state-of-the-art target models.

To enable reproducibility, we have made available our source code with directions to repeat our experiments. EG-Booster code is available at:

2. Background

In this section, we introduce ML explanation methods and ML evasion attacks.

2.1. ML Explanation Methods

Humans typically justify their decision by explaining underlying causes used to reach a decision. For instance, in an image classification task (e.g., cats vs. dogs), humans attribute their classification decision (e.g., cat) to certain parts/features (e.g., pointy ears, longer tails) of the image they see, and not all features have the same importance/weight in the decision process. ML models have long been perceived as black-box in their predictions until the advent of explainable ML (LIME; DeepLIFT; SHAP), which attribute a decision of a model to features that contributed to the decision. This notion of attribution is based on quantifiable contribution of each feature to a model’s decision. Intuitively, an explanation is defined as follows: given an input vector , a model , and a prediction label , an explanation method determines why input has been assigned the label . This explanation is typically represented as a weight vector that captures contribution of each feature towards .

ML explanation is usually accomplished by training a substitute model based on the input feature vectors and output predictions of the model, and then use the coefficients of that model to approximate the importance and direction (class label it leans to) of a feature. A typical substitute model for explanation is of the form: , where is the number of features, is the sample, is the feature for sample , and is the weight of feature to the model’s decision. While ML explanation methods exist for white-box (Whitebox-exp13; Whitebox-exp14) or black-box (LIME; LEMNA; SHAP) access to the model, in this work we consider ML explanation methods that have black-box access to the ML model, among which the notable ones are LIME (LIME), SHAP (SHAP) and LEMNA (LEMNA). Next, we briefly introduce these explanation methods.

LIME and SHAP. Ribeiro et al. (LIME) introduce LIME as one of the first model-agnostic black-box methods for locally explaining model output. Lundberg and Lee further extended LIME by proposing SHAP (SHAP). Both methods approximate the decision function by creating a series of perturbations of a sample , denoted as by randomly setting feature values in the vector to . The methods then proceed by predicting a label for each of the perturbations. This sampling strategy enables the methods to approximate the local neighborhood of at the point

. LIME approximates the decision boundary by a weighted linear regression model using Equation



In Equation 1, is the set of all linear functions and is a function indicating the difference between the input and a perturbation . SHAP follows a similar approach but employs the SHAP kernel as weighting function , which is computed using the Shapley Values (shapley)

when solving the regression. Shapley Values are a concept from game theory where the features act as players under the objective of finding a fair contribution of the features to the payout –in this case the prediction of the model.

LEMNA. Another black-box explanation method specifically designed to be a better fit for non-linear models is LEMNA (LEMNA). As shown in Equation 2, it uses a mixture regression model for approximation, that is, a weighted sum of linear models.


In Equation 2, the parameter

specifies the number of models, the random variables

originate from a normal distribution

and holds the weights for each model. The variables are the regression coefficients and can be interpreted as linear approximations of the decision boundary near .

Evaluation of ML explanation methods. Recent studies have focused on evaluating ML explainers for security applications (explainEval; explainEval2020). Authors proposed security-related evaluation criteria such as Accuracy, Stability and Robustness. Results have shown that learning-based approach such as LIME, SHAP and LEMNA suffer from unstable output. In particular, their explanations can slightly differ between two distinct runs. In this work, we take measures to ensure that EG-Booster does not inherit the potential stability limitation of the employed ML explanation method (more in Section 4.4). Furthermore, recent studies (exp-Rob1; exp-Rob2) have demonstrated that the explanation results are sensitive to small systematic feature perturbations that preserve the predicted label. Such attacks can potentially alter the explanation results, which raises concerns about the robustness of ML explanation methods. In EG-Booster, baseline attacks and explanation methods are locally integrated into the robustness evaluation workflow (hence not disclosed to an adversary).

2.2. ML Evasion Attacks

Given a ML model (e.g., image classifier, malware classifier) with a decision function

that maps an input sample to a true class label , = is called an adversarial sample with an adversarial perturbation if: , where is a distance metric (e.g., one of the norms) and is the maximum allowable perturbation that results in misclassification while preserving semantic integrity of . Semantic integrity is domain and/or task specific. For instance, in image classification, visual imperceptibility of from is desired while in malware detection and need to satisfy certain functional equivalence (e.g., if was a malware pre-perturbation, is expected to exhibit maliciousness post-perturbation as well). In untargeted evasion, the goal is to make the model misclassify a sample to any different class (e.g., for a roadside sign detection model: misclassify red light as any other sign). When the evasion is targeted, the goal is to make the model to misclassify a sample to a specific target class (e.g., in malware detection: misclassify malware as benign).

Evasion attacks can be done in white-box or black-box setting. Most gradient-based evasion techniques  (FGSM; BIM; PGSM; CW)

are white-box because the adversary typically has access to model architecture and parameters/weights, which allows to query the model directly to decide how to increase the model’s loss function. Gradient-based strategies assume that the adversary has access to the gradient function of the model (i.e., white-box access). The core idea is to find the perturbation vector

that maximizes the loss function of the model , where are the parameters (i.e., weights) of the model . In recent years, several white-box adversarial sample crafting methods have been proposed, specially for image classification tasks. Some of the most notable ones are: Fast Gradient Sign Method (FGS) (FGSM), Basic Iterative Method (BIM) (BIM), Projected Gradient Descent (PGD) method (PGSM), and Carlini & Wagner (C&W) method (CW). Black-box evasion techniques (e.g., MIM (MIM), HSJA (HSJA20), SPSA (uesato2018adversarial)) usually start from some initial perturbation , and subsequently probe on a series of perturbations , to craft such that misclassifies it to a label different from its original.

In Section 3, we briefly introduce the seven reference evasion attacks we used as baselines for EG-Booster.

3. Studied Baseline Attacks

In this section, we succinctly highlight the four white-box and three black-box attacks we use as baselines for EG-Booster.

3.1. White-Box Attacks

Fast-Gradient Sign Method (FGS)  (FGSM) is a fast one-step method that crafts an adversarial example. Considering the dot product of the weight vector and an adversarial example (i.e., ), , the adversarial perturbation causes the activation to grow by . Goodfellow et al. (FGSM) suggested to maximize this increase subject to the maximum perturbation constraint by assigning . Given a sample , the optimal perturbation is given as follows:


Basic Iterative Method (BIM)  (BIM) was introduced as an improvement of FGS. In BIM, the authors suggest applying the same step as FGS multiple times with a small step size and clip the pixel values of intermediate results after each step to ensure that they are in an -neighbourhood of the original image. Formally, the generated adversarial sample after iterations is given as follows:


Projected Gradient Descent (PGD) (PGSM) is basically the same as BIM attack. The only difference is that PGD initializes the example to a random point in the ball of interest 111A ball is the volume space bounded by a sphere; it is also called a solid sphere (i.e., allowable perturbations decided by the norm) and does random restarts, while BIM initializes to the original point .

Carlini-Wagner (C&W) (CW) is one of the most powerful attacks, where the adversarial example generation problem is formulated as the following optimization problem:


The goal is to find a small change such that when added to an image , the image is misclassified (to a targeted class ) by the model but the image is still a valid image. is some distance metric (e.g , or ). Due to the non-linear nature of the classification function , authors defined a simpler objective function such that if and only if . Multiple options of the explicit definition of are discussed in the paper  (CW) (e.g., ). Considering the norm as the distance , the optimization problem is simplified as follows:


where c ¿ 0 is a suitably chosen constant.

3.2. Black-Box Attacks

Momentum Iterative Method (MIM) (MIM) is proposed as a technique for addressing the likely limitation of BIM: the greedy move of the adversarial example in the direction of the gradient in each iteration can easily drop it into poor local maxima. It does so by leveraging the momentum method  (momentum) that stabilizes gradient updates by accumulating a velocity vector in the gradient direction of the loss function across iterations, and allows the algorithm to escape from poor local maxima. For a decay factor the gradient for iteration is updated as follows:


where the adversarial example is obtained as .

HopSkipJump Attack (HSJA) (HSJA20)

is an attack that relies on prediction label only to create an adversarial example. The key intuition of HSJA is based on gradient-direction estimation, motivated by zeroth-order optimization. Given an input

, first the gradient direction estimation is computed as follows:

A perturbed input is a successful attack if and only if . The boundary between successful and unsuccessful perturbed inputs is given by bd() = . As an indicator of successful perturbation, the authors introduce the Boolean-valued function via:

The goal of an adversarial attack is to generate a perturbed sample such that , while keeping close to the original input sample . This can be formulated as an optimization problem: such that , where is a distance metric that quantifies similarity.

Simultaneous Perturbation Stochastic Approximation (SPSA) by Uesato et al.  (uesato2018adversarial)

is an attack that re-purposes gradient-free optimization techniques into adversarial example attacks. They explore adversarial risk as a measure of the model’s performance on worst-case inputs. Since the exact adversarial risk is computationally intractable to exactly evaluate, they rather frame commonly used attacks (such as the ones we described earlier) and adversarial evaluation metrics to define a tractable surrogate objective to the true adversarial risk. The details of the algorithm are in 


4. Explanation-Guided Booster

In this section, we describe the details of EG-Booster.

Result: , ,
1 Input:
2 : black-box classification function;
3 : legitimate input sample;
4 : ’s adversarial variant returned by baseline attack;
5 : feature weights (explanations) of a toward ;
6 : norm used in the baseline attack;
7 : perturbation bound used in the baseline attack;
8 : maximum iterations to make bounded perturbation ;
9 Initialization:
10 ;
11 ;
12 ;
13 );
14 ,;
15 Output:
// Eliminate non-consequential perturbations
16 foreach  do
17       if  then
18             ;
19             if  =  then
21            else
23             end if
25       end if
27 end foreach
// Add consequential perturbations if baseline attack fails
28 foreach  do
29       if  then
30             ;;
31            ;while  do
32                   ;;if  then
33                         break;
34                   end if
36             end while
37            if  then
38                   ;
39            else
41             end if
43       end if
45 end foreach
Algorithm 1 Explanation-Guided Booster Attack.

4.1. Overview

In the reference attacks we introduced in Section 3, without loss of generality, the problem of crafting an adversarial example can be stated as follows: Given an input such that , the goal of the attack is to find the optimal perturbation such that the adversarial example is misclassified as . This problem is formulated as:


The only constraint of Equation 8 is that the perturbation size of the vector is bounded by a maximum allowable perturbation size (i.e., ), which ensures that adversarial manipulations preserve the semantics of the original input. However, it does not guarantee that all single feature perturbations (i.e., ) are consequential to result in evasion. Our Explanation-Guided Booster (EG-Booster) approach improves Equation 8 to guide any norm attack to perform only necessary perturbations that can cause evasion. In addition to the upper bound constraint , EG-Booster satisfies an additional constraint that guarantees the perturbation of only the features that are initially contributing to a correct prediction (positive features). In other words, EG-Booster serves to guide the state-of-the-art attacks towards perturbing only features with positive explanation weights where are the feature weights (explanations) of the input towards the true label (as explained in Section 2.1). Formally, the EG-Booster adversarial example crafting problem is stated as follows:


The second constraint ensures that only features with positive explanation weights are selected for perturbation.

As shown in Algorithm 1 (line 10), first, EG-Booster performs ML explanation on the input , in order to gather the original direction of each feature with respect to the true label . Next, using explanation results (), EG-Booster reviews initial perturbations performed by a baseline attack by eliminating features that have negative explanation weight (lines 16–25) and adding more perturbations on features that have positive explanation weight (lines 26–44). We ensure that EG-Booster’s intervention is pre-conditioned on maintaining at least the same evasion success of an evasion attack, if not improving it. Particularly, in case the initial adversarial sample generated by a baseline attack succeeds to evade the model, EG-Booster eliminates only perturbations that do not affect the initial evasion result. In this case, even if there exist unperturbed positive features when was crafted, EG-Booster does not perform any additional perturbations since the evasion result has been already achieved.

Next, we use Algorithm 1 to further explain the details of EG-Booster.

4.2. Eliminating Non-Consequential Perturbations

Figure 1. An example of the impact of eliminating non-consequential perturbations from a perturbed MNIST input that already succeeds to evade a CNN model using FGS. We put the new prediction of each version of the input image in parentheses (.).

As shown in Algorithm 1 (lines 16–25), EG-Booster starts by eliminating unnecessary perturbations that are expected to be non-consequential to the evasion result (if any). If the adversarial input produced by the baseline attack is already leading to a successful evasion, EG-Booster ensures that the perturbation elimination step does not affect the initial evasion success (lines 19–20). Accordingly, EG-Booster intervenes to eliminate the perturbations that have no effect on the final evasive prediction. More precisely, it searches for perturbed features that are originally not directed to the true label (line 17), and then it restores the original values of those features (i.e., eliminates the perturbation) (line 18) while ensuring the validity of the initial evasion if it exists (it does so by skipping any elimination of feature perturbations that do not preserve the model evasion). As we will show in our experiments (Section 5.3), this step can lead to a significant drop in the total number of perturbed features which results in a smaller perturbation size .

Figure 1 shows an MNIST-CNN example of the impact of enhancing the FGS (FGSM) attack with EG-Booster when the baseline perturbations are already evasive. In this example, FGS perturbations fool the CNN model to incorrectly predict digit ‘9’ as digit ‘8’. The red circles on the left hand side image show the detection of non-consequential perturbations signaled by EG-Booster using the pre-perturbation model explanation results. The new version of the input image shown on the right hand side of the figure illustrates the impact of eliminating those non-consequential feature perturbations. The number of feature perturbations is reduced by 7% while preserving the evasion success.

The other alternative is when the baseline attack initially fails to evade the model. In such a case, it is crucial to perform this perturbation elimination step over all non-consequential features before making any additional perturbations. Doing so reduces the perturbation size which provides a larger gap () for performing additional consequential perturbations that can cause a misclassification.

4.3. Adding Consequential Perturbations

Figure 2. An example of the impact of adding consequential perturbations to a perturbed MNIST input that failed to evade a CNN model using FGS. We put the new prediction of each version of the input image in parentheses (.).

This second step is only needed in case the adversarial sample has still failed to evade the model after eliminating non-consequential perturbations. In this case, EG-Booster starts searching for unperturbed features that are directed towards the true label . When it finds such features, it then incrementally adds small perturbations to these features since they are signaled by the ML explainer as crucial features directed to a correct classification (lines 28–29). The function get_delta(x) carefully chooses a random feature perturbation that satisfies the feature constraint(s) of the dataset at hand (line 28). For instance, in image classification, if the feature representation of the samples is in gray-scale, pixel perturbation should be in the allowable range of feature values (e.g., [] for normalized MNIST features). In case a feature perturbation breaks the upper bound constraint , EG-Booster iteratively keeps reducing the perturbation until the bound constraint is reached or the number of iterations hits its upper bound (lines 30–37).

In case all features directed to the correct label are already perturbed, EG-Booster proceeds by attempting to increase the perturbations initially performed by the baseline attack on positive features. It does so while respecting the upper bound constraint (). This process immediately terminates whenever the evasion goal is achieved.

Figure 2 shows an MNIST-CNN example of the impact of enhancing the FGS attack (FGSM) with EG-Booster. In this example, FGS perturbations originally failed to fool the CNN model to predict an incorrect label (i.e., ‘7’). The right hand side image shows the perturbations (circled in green) added by EG-Booster that are consequential to fool the CNN model into predicting the input image as ‘9’ instead of ‘7’.

4.4. Stability Analysis

As stated earlier (Section 2.1), when evaluated on security-sensitive learning tasks (e.g., malware classifiers), ML explanation methods have been found to exhibit output instability where by explanation weights of the same input sample can differ from one run to another (explainEval; explainEval2020). In order to ensure that EG-Booster does not inherit the instability limitation of the ML explainer, we perform stability analysis where we compare the similarity of the prediction results returned by the target model after performing an explanation-guided attack over multiple runs. More precisely, we define the k-stability metric that quantifies the stability of EG-Booster across k distinct runs. It measures the average run-wise (e.g., each pair of runs) similarity between the returned predictions of the same adversarial sample after EG-Booster attack. The similarity between two runs and , , is the intersection size of the predictions returned by the two runs over all samples in the test set (i.e., the number of matching predictions). Formally, we define the k-stability as follows.


The instability of explanation methods is mainly observed in the minor differences between the magnitude of the returned explanation weights of each feature which might sometimes lead to different feature ranking. However, the direction of a feature is less likely to change from one run to another as it is uniquely decided by the sign of its explanation weight . This is mainly visible on images as it is possible to plot feature directions with different colors on top of the original image. EG-Booster only relies on the sign of the explanation weights in the process of detecting non-consequential perturbations and candidates for consequential perturbations. As a result, the output of EG-Booster is expected to be more stable than output of the underlying explanation method.

To validate our intuition we compare the stability of EG-Booster with the stability of the employed explanation method. Thus, we compute the (k,l)-Stability metric proposed by a prior work (explainEval2020) that evaluates the stability of ML explanation methods for malware classification models. The (k,l)-Stability measure computes the average explanation similarity based on intersection of features that are ranked in the top-l, returned by separate runs of an explanation method. More precisely, given two explanation results, and returned by two different runs and , and a parameter , their similarity is obtained based on the Dice coefficient:.


where denotes the top- features of sample with respect to the prediction returned by run . Over a total of -runs, the average (k,l)-Stability on a sample is defined as follows:


More empirical discussions about the stability results over all samples in a test set are presented in Section 5.4.

5. Experimental Evaluation

In this section we report our findings through systematic evaluation of EG-Booster. We first describe our experimental setup in Section5.1. In Section 5.2, we present details of EG-Booster evasion accuracy effectivness. In Sections 5.3, 5.4, and 5.5, we present our findings on perturbation change, stability analysis, and execution time of EG-Booster, respectively.

5.1. Experimental Setup

Datasets. We evaluate EG-Booster on two standard datasets used for adversarial robustness evaluation of deep neural networks: MNIST (MNIST), a handwritten digit recognition task with ten class labels () and CIFAR10(cifar), an image classification task with ten classes. We use the K test samples of both datasets to evaluate the performance of EG-Booster on undefended neural networks (described next). To evaluate EG-Booster against a defended model, we use K samples of the CIFAR10 test set.

Models. For MNIST, we train a state of the art 7-layer CNN model from “Model Zoo”222, provided by the SecML library (melis2019secml)

. It is composed of 3-conv layers+ReLU, 1-Flatten layer, 1-fully-connected layer+ReLU, a dropout-layer (p=0.5), and finally a Flatten-layer. We call this model

MNIST-CNN. It reaches a test accuracy of over all K test samples.

For CIFAR10, we consider two different models: a CNN model (CIFAR-CNN) with 4-conv2D and a benchmark adversarially-trained ResNet50 model. In our experimental results, we call the first undefended model CIFAR10-CNN and the second defended model CIFAR10-ResNet

. More precisely, the structure of CIFAR10-CNN is: 2-conv2D+ ReLU, 1-MaxPool2D, 1-dropout(p=0.25), 2-conv2D+ReLU, 1-MaxPool2D, 1-dropout(p=0.25), 2-fully-connected+ReLU, 1-dropout(p=0.25), and 1-fully-connected. To avoid the reduction of height and width of the images, padding is used in every convolutional layer. This model reaches a test accuracy of

after epochs. It outperforms the benchmark model adopted by Carlini & Wagner (2017)(CW) and Papernot et al. (2016) (distillation).

CIFAR10-ResNet is a state-of-the-art adversarially-trained network model from the”robustness” library (robustness) proposed by MadryLab 333 It is trained on images generated with PGD attack using norm and as perturbation bound. It reaches a test accuracy of over all K test set of CIFAR10.

Baseline Attacks. We evaluate EG-Booster in a white-box setting using 4 state-of-the-art white-box baseline attacks i.e., FGS (FGSM), BIM (BIM), PGD (PGSM), and C&W (CW). Furthermore, we validate EG-Booster in a black-box setting using 3 state-of-the-art black-box attacks: MIM (MIM), HSJA (HSJA20), and SPSA (uesato2018adversarial). Detailed explanations of all the 7 baseline attacks are provided in Section 3. We provide the hyper-parameter choices of each attack in the Appendix (Table 4 in Section 7.1).

ML Explainer. The performance of EG-Booster is influenced by the effectiveness of the employed ML explanation method in detecting the direction of each feature in a sample . It is, therefore, crucial to suitably choose the ML explainer according to the studied systems and the deployed models. In our experiments, we focus on image datasets and Neural Network models, therefore, we pick SHAP (SHAP)

, as it is proved to be effective in explaining deep neural networks, especially in the image classification domain. Furthermore, SHAP authors proposed a ML explainer called “Deep Explainer”, designed for deep learning models, specifically for image classification. SHAP has no access to the target model, which makes EG-Booster suitable for either black-box or white-box threat model. We note that independent recent studies

(explainEval; explainEval2020) evaluated ML explanation methods for malware classifiers. They revealed that LIME outperforms other approaches in security systems in terms of accuracy and stability. Thus, we recommend using LIME for future deployment of EG-Booster for robustness evaluation of ML malware detectors.

Evaluation metrics. We use the following four metrics to evaluate EG-Booster:

Evasion Rate: First, we quantify the effectiveness of EG-Booster by monitoring changes to the percentage of successful evasions with respect to the total test set, from the baseline attacks to EG-Booster attacks.

Average Perturbation Change: Second, we keep track of the average changes to the number of perturbed features using baseline attack versus using EG-Booster attack to show the impact of the addition or elimination of perturbations performed by EG-Booster. Referring to Algorithm 1 (lines 23 and 42), EG-Booster keeps track of the number of added perturbations () and the number of eliminated perturbations () per-image. Thus, the total of perturbation changes per-image is , and the average perturbation change across all images of the test set () is:


where is the total number of perturbations initially performed by the baseline attack on an example . When Average Perturbation Change, on average, EG-Booster is performing more perturbation additions than eliminations (). Otherwise, it is performing more elimination of non-consequential perturbations (). The average becomes zero when there are equal number of added and eliminated perturbations.

k-Stability & (k,l)-Stability: To evaluate the reliability of EG-Booster, we use the (Equation 10) and (Equation 12) measures we introduced in Section 4.4. We compute the of EG-Booster for different values and we compare it to the value of the of the employed explanation method (i.e., SHAP), using the top-10 features and different values of . Both metrics are calculated in average over samples.

Execution Time: Across both datasets and all models, we measure the per-image execution time (in seconds) taken by EG-Booster for different baseline attacks.

angle=0 Undefended MNIST-CNN Baseline Attacks White-Box Black-Box FGS () BIM () PGD () C&W () MIM () Norms Initial Evasion Rate 28.20% 49.92% 99.80 % 55.09% 99.14% 41.34% EG-Booster Evasion Rate 89.71% 89.08% 99.89% 88.69% 99.72% 79.08% Average Perturbation Change +14.44% +14.47% -5.46% -4.39% -15.36% -5.49%

Table 1. Summary of EG-Booster results on undefended MNIST-CNN Model ().

angle=0 Undefended CIFAR10-CNN Baseline Attacks White-Box Black-Box FGS () PGD () C&W() HSJA () Norms Initial Evasion Rate 22.36% 83.04% 22.46% 91.07% 99.22% 14.04% EG-Booster Evasion Rate 30.14% 87.45% 31.12% 92.34% 99.22% 28.87% Average Perturbation Change -48.52% -49.62% -48.25% -51.15% -52.16% -36.18%

Table 2. Summary of EG-Booster results on undefended CIFAR10-CNN Model ().

angle=0 Defended CIFAR10-ResNet Baseline Attacks White-Box Black-Box FGS () PGD () C&W () SPSA () Norms Initial Evasion Rate 9.75% 73.27% 9.66% 86.25% 99.00% 10.80% EG-Booster Evasion Rate 18.05% 74.73% 18.37% 86.76%* 99.75% 23.46% Average Perturbation Change -41.23% -45.04% -41.16% -46.51% -51.20% -42.11%

Table 3. Summary of EG-Booster results on defended (adversarially-trained) CIFAR10-ResNet50 Model ().

5.2. Evasion Results

Tables 1, 2, and 3 report results of EG-Booster on different networks; the undefended MINST-CNN, undefended CIFAR10-CNN, and defended (adversarially-trained) CIFAR10-ResNet, respectively.
Evasion rate in a nutshell: Across the three tables, we observe a significant increase in the evasion rate of studied baseline attacks after combining them with EG-Booster. For instance, Table 1 shows an average increase of of the evasion rate across all attacks performed on the undefended MNIST-CNN model. Similarly, the evasion rate results in Table 2, convey that, on average across all studied baseline attacks, more adversarial images are produced by EG-Booster to evade the undefended CIFAR10-CNN model. Same observations can be drawn from Table 3. More precisely, overall baseline attacks performed on the adversarially-trained CIFAR-ResNet model, we observe an average of increase in the evasion rate, when combined with EG-Booster. In a nutshell, our findings consistently suggest that explanation methods can be employed to effectively guide evasion attacks towards higher evasion accuracy.

Figure 3. The evasion rate curve of baseline attacks against EG-Booster attacks across different perturbation bounds , using for MNIST and for CIFAR10 as distance metrics.

EG-Booster is consistently effective across model architectures, threat models, and distance metrics: Our results consistently suggest that EG-Booster is agnostic to model architecture, threat model, and supports diverse distance metrics. For instance, the increase in evasion rate is observed on white-box baseline attacks (i.e., FGS, BIM, PGD, and C&W) as well as for black-box attacks (i.e., MIM, HSJA, and SPSA). Additionally, the improvement in baseline attacks performance is observed for different norms. For the same perturbation bound value and the same attack strategy (e.g., FGS, PGD), we notice an increase in the evasion rate regardless of the employed distance metric.

EG-Booster is consistently effective over a range of perturbation bounds: Our experiments additionally cover the assessment of EG-Booster for different values. Figure 3 reports the evasion rate curve of the state-of-the-art attacks (e.g., PGD, MIM,etc) before and after being guided with EG-Booster (see EG-PGD, EG-MIM, etc in Figure 3), using different perturbation bounds . For all target models trained on MNIST and CIFAR10, we observe a significant improvement in the evasion rate of all baseline attacks regardless of . However, for most attacks, we observe that, a higher perturbation bound which allows a greater perturbation size can lead to a higher evasion rate.

It is noteworthy that the increase in rate of evasion results is considerably higher for baseline attacks that initially have a low evasion rate. For instance, FGS and BIM that initially produced, respectively, and evasive adversarial samples from the total of 10K MNIST samples, have improved at least twofold after employing EG-Booster. Similar observations can be drawn from CIFAR10-CNN and CIFAR10-ResNet. Particularly, for CIFAR10-CNN, the increase rate of evasion results on FGS() is higher than the increase rate of FGS (). However, even though EG-Booster results in higher increase rate in evasive samples for less preferment attacks, we note that, the total evasion rate of EG-Booster combined with stronger attacks is more important. This is expected, as EG-Booster builds on the initial results of baseline attacks which result in a correlation between the initial evasion rate of a baseline attack and the post-EG-Booster evasion rate. This is mainly observed for the C&W attack, as across all models, combined with C&W, EG-Booster exhibits the highest evasion rates (i.e., ). Such correlation is also evident in the evasion rate curves from Figure 3.

EG-Booster is still effective even against defended models: In Table 3, we focus our experiments on a defended model in order to examine the effectiveness of EG-Booster against adversarial training defence. Since is used for adversarial training, CIFAR10-ResNet is expected to be specifically robust against gradient-based attacks performed using norm. Incontrovertibly, and achieve considerably lower initial evasion rates on CIFAR10-ResNet (), compared to undefended models(). Nevertheless, combined with EG-Booster and result in more evasive samples (), even against a defended model. Same findings can be observed for other baseline attacks (e.g. , , etc). We conclude that EG-Booster is still effective even against defended models. Given this promising results, we hope that EG-Booster will be considered in the future as a benchmark for robustness assessment of ML models.

5.3. Perturbation Change

In addition to the evasion rate metric, we keep track of the perturbation changes introduced by EG-Booster for each experiment. More precisely, we compute the average of perturbation change including the number of added and eliminated perturbations (Equation 13). Results from the last rows of Tables 1, 2 and 3 show that for most baseline attacks, EG-Booster is making less perturbations while improving the evasion rate. This is explained by the negative sign of the average perturbation change for most of the studied baseline attacks across the three tables. These findings prove that, without considering the pre-perturbation feature direction explanations, baseline attacks are initially performing a considerable number of non-consequential (unnecessary) perturbations. Consequently, in addition to the improvement of the evasion rate, these results demonstrate the importance of taking into account the features explanation weights in the formulation of evasion attacks (formulation 9).

It is noteworthy that the magnitude of the negative average perturbation change values is specifically more important for baseline attacks that initially have a high evasion rate. This is mainly true for the C&W attack across the 3 tables and in Table 2. We explain this observation by the fact that, EG-Booster performs additional perturbations only in the case that the baseline adversarial sample originally fails to evade the model (Algorithm 1:line 27), otherwise, it would only perform the elimination of non-consequential perturbations while maintaining the original evasion result.

In some of the experiments, we notice a positive value of the average perturbation change. This is particularly the case for and in Table 1. In this case, EG-Booster performs more additional perturbations than eliminations which reflects the drastic improvement of the evasion rate for these two attacks. These findings particularly demonstrate the direct impact of perturbing features that are directed to the true label in confusing the model to make a correct prediction.

Focusing on Tables 1 and 2, we notice that, on average, the increase rate in evasion results for MNIST () is higher than CIFAR-10 () using a CNN model for both datasets. Additionally, the average perturbation change for all experiments in Table 2 (i.e., CIFAR10-CNN) is negative and have a higher magnitude than their counterparts in Table 1 (MNIST-CNN). Further investigations have shown that these two observations are related. In particular, we found that baseline attacks performed on CIFAR10-CNN are already perturbing a considerable number of features that are directed to the true label which makes the rate of added perturbations by the EG-Booster in CIFAR10-CNN lower than the ones added in MNIST-CNN. This observation might explain the difference in evasion increase between both datasets. However, as discussed in section 4, EG-Booster specifically examines this case i.e, the case where an important number of detected consequential features turn out to be already perturbed. More precisely, in case of initial evasion failure, EG-Booster proceeds by additionally perturbing features that are already perturbed by the baseline attack while ensuring that the perturbation size is still within the bound ().

Figure 4. Comparison between the stability of EG-Booster and the employed explanation method, SHAP, across models and datasets, for different values. of SHAP is computed on the top-10 features (). EG-Booster is performed using different baseline attacks subject to . runs are performed on test samples for each attack.

5.4. Stability Analysis

As highlighted by prior works (explainEval; explainEval2020) and thoroughly discussed in this paper, due to their stability concern, ML explanation methods should be carefully introduced when adopted in security-critical settings. Thus, we evaluate the reliability of explanation methods to be employed for the robustness assessment of ML systems by conducting a comparative analysis between the stability of the employed ML explainer (i.e., SHAP) and the stability of our explanation-based method. In particular, we use the (k-l)-Stability and k-Stability metrics defined in Section 4.4, to respectively compute the output’s stability of SHAP, and EG-Booster combined with baseline attacks, across different studied models.

EG-Booster doesn’t inherit instabilities of SHAP: In Figure 4, we plot our stability analysis results. Focusing on the stability curves of EG-Booster, we deduce that, it is almost stable for all studied models and across different baseline attacks. Consequently, the classification results returned by the target ML model are exactly the same across different runs of EG-Booster. Such findings prove the reliability of our approach to be confidently adopted for improved robustness assessment of ML models against baseline attacks. Compared to the stability curves of SHAP, we observe that EG-Booster does not inherit the instability concern indicated by ML explainer’s output. As discussed in Section 4.4, these findings are explained by the reliance only on the sign of explanation weights to decide the feature directions (i.e.,positive and negative

features). Since the distortions across different runs of the feature directions results returned by an accurate ML explainer are minor, the feature selection for perturbation performed by EG-Booster returns the same feature sets across different runs which overall leads to the same evasion results.

Although EG-Booster is not deeply influenced by the stability of the employed ML explainer, its success is, however, relying on the their accuracy to produce precise explanations, specifically precise feature directions in our case. Given the promising results, that our approach showed when relying on SHAP, we hope that future improvements in ML explanation methods would lead to an even better performance of EG-Booster.

5.5. Execution Time

The measurements we present here are based on a hardware environment with 6 vCPUs, 18.5 GB memory, and 1 GPU-NVIDIA Tesla K80.

In Figure 5, we plot the range of execution times of EG-Booster recorded by different experiments. EG-Booster takes, on average , and per-sample, respectively, for MNIST-CNN, CIFAR10-CNN, and CIFAR10-ResNet (orange line). Overall, these observations reflect the efficiency of EG-Booster, however, they additionally, reveal the impact of the input dimension and the model architecture on the running time of EG-Booster. For instance, it takes more time on CIFAR10, compared to MNIST, which is due to the difference in input dimension between both datasets i.e., (28,28) for MNIST and (3,32,32) for CIFAR10. Additionally, for the same dataset (i.e., CIFAR10), we observe a considerable difference in execution time of EG-Booster between CNN and ResNet50 (i.e., ) which is explained by the higher dimension of ResNet50 compared to CNN. This final result is expected, since EG-Booster is a query-based approach (Algorithm 1: lines 10,13,19, and 27). Thus, the duration that a neural network takes to respond to a query influences the total execution time of EG-Booster.

Figure 5. Execution time of EG-Booster performed on baseline attacks, per-sample, across the studied models.

6. Related Work

In recent years, we have witnessed a flurry of adversarial example crafting methods (Carlini-list) and keeping up-to-date on new research in this sub-domain is a challenge. Nevertheless, in the following we position EG-Booster with respect to closely related work.

Evasion Attacks. Here, we focus on benchmark attacks that EG-Booster builds up on. While ML evasion attacks can be traced back to early 2000s (Wild-patterns18), recent improvements in accuracy of deep neural networks on non-trivial learning tasks have triggered adversarial manipulations as potential threats to ML models. Biggio et al. (Biggio-ECML13) introduced gradient-based evasion attacks against PDF malware detectors. Szegedy et al. (szegedy2014intriguing) demonstrated gradient-based attacks against image classifiers. Goodfellow et al. (FGSM) introduced FGSM, a fast one-step adversarial example generation strategy. Kurakin et al. (BIM) later introduced BIM as an iterative improvement of FGSM. Around the same time, Madry et al. (PGSM) proposed PGSM and Uesato et al. (PGSM) later introduced PGD. The common thread among these class of attacks and similar gradient-based ones is that they all aim to solve for an adversarial example misclassified with maximum confidence subject to a bounded perturbation size.

Another line of evasion attacks aim to minimize the perturbation distance subject to the evasion goal (e.g., classify a sample to a target label chosen by the adversary). Among such attacks, Carlini and Wagner (CW) introduced the C&W attack, one of the strongest white-box attacks to date. Other attacks such as DeepFool (DeepFool16) and SparseFool (SparseFool19) use the same notion of minimum perturbation distance. Recent attacks introduced by Brendel et al. (Accurate-Fast19) and Croce and Hein (Croce020a) improve on the likes of DeepFool and SparseFool, both on accuracy and speed of adversarial example generation. EG-Booster builds up on this body of adversarial example crafting attacks. As our evaluations suggest, it improves the evasion rate of major benchmark attacks used by prior work.

Pintor et al. (FMN21) propose the fast minimum-norm (FMN) attack that improves both accuracy and speed of benchmark gradient-based attacks across norms (), for both targeted and untargeted white-box attacks. Like FMN, EG-Booster covers MNIST and CIFAR10 datasets, evaluated on undefended and defended models, considers norm distance metrics, and on aggregate improves evasion accuracy of existing attacks. Unlike FMN, EG-Booster covers black-box and white-box attacks, is not limited to gradient-based attacks, instead of proposing an attack strategy it rather leverages explanations to improve accuracy of existing attacks, and is limited to untargeted attacks.

Among black-box evasion attacks, we highlight MIM (MIM), HSJA (HSJA20), and SPSA (uesato2018adversarial). MIM leverages the momentum method (momentum) that stabilizes gradient updates by accumulating a velocity vector in the gradient direction of the loss function across iterations, and allows the algorithm to escape from poor local maxima. HSJA relies on prediction label only to create adversarial examples, with key intuition based on gradient-direction estimation, motivated by zeroth-order optimization. In SPSA, Uesato et al. (uesato2018adversarial) explore adversarial risk as a measure of the model’s performance on worst-case inputs, to define a tractable surrogate objective to the true adversarial risk.

Explanation-Guided Attacks. A recent body of work demonstrate how ML explanation methods can be harnessed to poison training data, extract model, and infer training data-points. Severi et al. (EG-Poisoning) leverage ML explanation methods to guide the selection of features to construct backdoor triggers that poison training data of malware classifiers. Milli et al. (EG-model-extraction19) demonstrate that gradient-based model explanations can be leveraged to approximate the underlying model. In a similar line of work, Aïvodji. et al. (EG-model-extraction20) show how counterfactual explanations can be used to extract a model. Very recently, Shokri et al. (EG-mem-inference2021)

study the tension between transparency via model explanations and privacy leakage via membership inference, and show backpropagation-based explanations can leak a significant amount of information about individual training data-points.

Compared to this body of work, EG-Booster is the first approach to leverage ML explanations to significantly improve accuracy of evasion attacks.

Reliability Analysis of Explanation Methods. Focusing on robustness of explanations to adversarial manipulations, Ghorbani et al. (exp-Rob1) and Heo et al. (exp-Rob2) demonstrate that model explanation results are sensitive to small perturbations of explanations that preserve the predicted label. If ML explanations were provided to EG-Booster as in “Explanation-as-a-Service”, an adversary could potentially alter the explanation results, which might in effect influence our explanation-guided attack. In EG-Booster, baseline attacks and explanation methods are locally integrated into the robustness evaluation workflow (hence not disclosed to an adversary).

Recent studies by Warnecke et al. (explainEval) and Fan et al. (explainEval2020) systematically analyze the utility of ML explanation methods especially on models trained for security-critical tasks (e.g., malware classifiers). In addition to general evaluation criteria (e.g., explanation accuracy and sparsity), Warnecke et al. (explainEval) focused on other security-relevant evaluation metrics (e.g., stability, efficiency, and robustness). In a related line of work, Fan et al. (explainEval2020) performed a similar study on Android malware classifiers that led to similar conclusions.

7. Conclusion

In this paper, we introduce EG-Booster, the first explanation-guided booster for ML evasion attacks. Guided by feature-based explanations of model predictions, EG-Booster significantly improves evasion accuracy of state-of-the-art adversarial example crafting attacks by introducing consequential perturbations and eliminating non-consequential ones. By systematically evaluating EG-Booster on MNIST and CIFAR10, across white-box and black-box attacks against undefended and defended models, and using diverse distance metrics, we show that EG-Booster significantly improves evasion accuracy of reference evasion attacks on undefended MNIST-CNN and CIFAR10-CNN models, and an adversarially-trained CIFAR10-ResNet50 model. Furthermore, we empirically show the reliability of explanation methods to be adopted for robustness assessment of ML models. We hope that EG-Booster will be used by future work as a benchmark for robustness evaluation of ML models against adversarial examples.



7.1. Baseline Attacks Hyper-Parameters

Besides the values of the employed distances and values that we specified in section 5, the parameters used for each baseline attacks are specified in Table 4:

Attack Hyper-Parameters
FGS , , which indicate the lower and upper bound of features. These two parameters are specified only for MNIST dataset.

PGD , , with solver parameters: (, , ) .
MIM , , , , ) .
SPSA , , .
HSJA , , , , , .
Table 4. Attack Hyper-Parameters.