1. Introduction
Machine Learning (ML) models are vulnerable to testtime evasion attacks called adversarial examples –adversariallyperturbed inputs aimed to mislead a deployed ML model (FGSM). Evasion attacks have been the subject of recent research (FGSM; BIM; PGSM; MIM; CW; HSJA20; uesato2018adversarial) that led to understanding the potential threats posed by these attacks when ML models are deployed in securitycritical settings such as selfdriving vehicles (DLautnonmous17), malware classification (MalConvEvade18), speech recognition (DLSpeech2012)
, and natural language processing
(NLP16).Given a ML model and an input with a true label , the goal of a typical evasion attack is to perform minimal perturbations to and obtain similar to such that is fooled to misclassify as . For a defender whose goal is to conduct predeployment robustness assessment of a model to adversarial examples, one needs to adopt a reliable input crafting strategy that can reveal the potential security limitations of the ML system. In this regard, a systematic examination of the suitability of each feature as a potential candidate for adversarial perturbations can be guided by the contribution of each feature in the classification result (amich2021explanationguided). In this context, recent progress in featurebased ML explanation techniques (LIME; SHAP; LEMNA) is interestingly positioned inline with the robustness evaluation goal. In fact, ML explanation methods have been recently utilized to guide robustness evaluation of models against backdoor poisoning of ML models (EGPoisoning), model extraction (EGmodelextraction19; EGmodelextraction20), and membership inference attacks (EGmeminference2021), which highlights the utility of ML explanation methods beyond ensuring the transparency of model predictions.
In this paper, we present EGBooster, an explanationguided evasion booster that leverages techniques from explainable ML (LIME; SHAP; LEMNA) to guide adversarial example crafting for improved robustness evaluation of ML models before deploying them in securitycritical settings. Inspired by a case study in (amich2021explanationguided), our work is the first to leverage blackbox model explanations as a guide for systematic robustness evaluation of ML models against adversarial examples. The key insight in EGBooster is the use of featurebased explanations of model predictions to guide adversarial example crafting. Given a model , a dimensional input sample , and a true prediction label such that
, a ML explanation method returns a weight vector
where each quantifies the contribution of feature to the prediction . The sign of represents the direction of feature with respect to . If , is directed towards (in this case we call a positive feature). In the opposite case, i.e., if , is directed away from (in this case we call a negative feature). EGBooster leverages signs of individual feature weights in two complementary ways. First, it uses the positive weights to identify positive features that are worth perturbing and introduces consequential perturbations likely to result in that will be misclassified as . Second, it uses negative weights to identify negative features that need not be perturbed and eliminates nonconsequential perturbations unlikely to contribute to the misclassification goal. EGBooster is agnostic to model architecture, adversarial knowledge and capabilities (e.g., blackbox, whitebox), and supports diverse distance metrics (e.g., common norms) used previously in the adversarial examples literature.In an orthogonal line of research that studied explanation stability and adversarial robustness of model explanations, some limitations of explanation methods have been documented (explainEval2020; explainEval). Recognizing these potential limitations which entail systematic vetting of the reliability of ML explanation methods before using them for robustness assessment of ML models, we introduce an explanation assessment metric called Stability, which measures the average stability of evasion results based on the similarities of the target model’s predictions returned by multiple runs of EGBooster.
We evaluate EGBooster through comprehensive experiments on two benchmark datasets (MNIST and CIFAR10), across whitebox and blackbox attacks, on stateoftheart undefended and defended models, covering commonly used norms. From whitebox attacks we use the Fast Gradient Sign Method (FGS) (FGSM), the Basic Iterative Method (BIM) (BIM), the Projected Gradient Descent method (PGD) (PGSM), and the Carlini and Wagner attack (C&W) (CW). From blackbox attacks, we use the Momentum Iterative Method (MIM) (MIM), the HopSkipJump Attack (HSJA) (HSJA20), and the Simultaneous Perturbation Stochastic Approximation (SPSA) attack (uesato2018adversarial).
Across all studied models, we observe a significant increase in the evasion rate of studied baseline attacks after combining them with EGBooster. Particularly, results show an average increase of of the evasion rate across all attacks performed on undefended MNISTCNN model. Similar findings are observed on undefended CIFAR10CNN models, and a defended CIFAR10ResNet model with average evasion evasion rate increase, respectively, of and . In addition to the reduction of the total number of perturbations compared to the baseline attacks, these findings prove that ML explanation methods can be harnessed to guide ML evasion attacks towards more consequential perturbations. Furthermore, the stability analysis results show that, EGBooster’s outputs are stable across different runs. Such findings highlight reliability of explanation methods to boost evasion attacks, despite their stability concerns. More detailed results are discussed in section 5.
In summary, this paper makes the following contributions:

ExplanationGuided Evasion Booster. We introduce the first approach that leverages ML explanation methods towards evaluating robustness of ML models to adversarial examples.

Stability Analysis. We introduce a novel stability metric that enables vetting the reliability of ML explanation methods before they are used to guide ML robustness evaluation.

Comprehensive Evaluation. We conduct comprehensive evaluations on two benchmark datasets, four whitebox attacks, three blackbox attacks, different distance metrics, on undefended and defended stateoftheart target models.
To enable reproducibility, we have made available our source code with directions to repeat our experiments. EGBooster code is available at: https://github.com/EGBooster/code.
2. Background
In this section, we introduce ML explanation methods and ML evasion attacks.
2.1. ML Explanation Methods
Humans typically justify their decision by explaining underlying causes used to reach a decision. For instance, in an image classification task (e.g., cats vs. dogs), humans attribute their classification decision (e.g., cat) to certain parts/features (e.g., pointy ears, longer tails) of the image they see, and not all features have the same importance/weight in the decision process. ML models have long been perceived as blackbox in their predictions until the advent of explainable ML (LIME; DeepLIFT; SHAP), which attribute a decision of a model to features that contributed to the decision. This notion of attribution is based on quantifiable contribution of each feature to a model’s decision. Intuitively, an explanation is defined as follows: given an input vector , a model , and a prediction label , an explanation method determines why input has been assigned the label . This explanation is typically represented as a weight vector that captures contribution of each feature towards .
ML explanation is usually accomplished by training a substitute model based on the input feature vectors and output predictions of the model, and then use the coefficients of that model to approximate the importance and direction (class label it leans to) of a feature. A typical substitute model for explanation is of the form: , where is the number of features, is the sample, is the feature for sample , and is the weight of feature to the model’s decision. While ML explanation methods exist for whitebox (Whiteboxexp13; Whiteboxexp14) or blackbox (LIME; LEMNA; SHAP) access to the model, in this work we consider ML explanation methods that have blackbox access to the ML model, among which the notable ones are LIME (LIME), SHAP (SHAP) and LEMNA (LEMNA). Next, we briefly introduce these explanation methods.
LIME and SHAP. Ribeiro et al. (LIME) introduce LIME as one of the first modelagnostic blackbox methods for locally explaining model output. Lundberg and Lee further extended LIME by proposing SHAP (SHAP). Both methods approximate the decision function by creating a series of perturbations of a sample , denoted as by randomly setting feature values in the vector to . The methods then proceed by predicting a label for each of the perturbations. This sampling strategy enables the methods to approximate the local neighborhood of at the point
. LIME approximates the decision boundary by a weighted linear regression model using Equation
1.(1) 
In Equation 1, is the set of all linear functions and is a function indicating the difference between the input and a perturbation . SHAP follows a similar approach but employs the SHAP kernel as weighting function , which is computed using the Shapley Values (shapley)
when solving the regression. Shapley Values are a concept from game theory where the features act as players under the objective of finding a fair contribution of the features to the payout –in this case the prediction of the model.
LEMNA. Another blackbox explanation method specifically designed to be a better fit for nonlinear models is LEMNA (LEMNA). As shown in Equation 2, it uses a mixture regression model for approximation, that is, a weighted sum of linear models.
(2) 
In Equation 2, the parameter
specifies the number of models, the random variables
originate from a normal distribution
and holds the weights for each model. The variables are the regression coefficients and can be interpreted as linear approximations of the decision boundary near .Evaluation of ML explanation methods. Recent studies have focused on evaluating ML explainers for security applications (explainEval; explainEval2020). Authors proposed securityrelated evaluation criteria such as Accuracy, Stability and Robustness. Results have shown that learningbased approach such as LIME, SHAP and LEMNA suffer from unstable output. In particular, their explanations can slightly differ between two distinct runs. In this work, we take measures to ensure that EGBooster does not inherit the potential stability limitation of the employed ML explanation method (more in Section 4.4). Furthermore, recent studies (expRob1; expRob2) have demonstrated that the explanation results are sensitive to small systematic feature perturbations that preserve the predicted label. Such attacks can potentially alter the explanation results, which raises concerns about the robustness of ML explanation methods. In EGBooster, baseline attacks and explanation methods are locally integrated into the robustness evaluation workflow (hence not disclosed to an adversary).
2.2. ML Evasion Attacks
Given a ML model (e.g., image classifier, malware classifier) with a decision function
that maps an input sample to a true class label , = is called an adversarial sample with an adversarial perturbation if: , where is a distance metric (e.g., one of the norms) and is the maximum allowable perturbation that results in misclassification while preserving semantic integrity of . Semantic integrity is domain and/or task specific. For instance, in image classification, visual imperceptibility of from is desired while in malware detection and need to satisfy certain functional equivalence (e.g., if was a malware preperturbation, is expected to exhibit maliciousness postperturbation as well). In untargeted evasion, the goal is to make the model misclassify a sample to any different class (e.g., for a roadside sign detection model: misclassify red light as any other sign). When the evasion is targeted, the goal is to make the model to misclassify a sample to a specific target class (e.g., in malware detection: misclassify malware as benign).Evasion attacks can be done in whitebox or blackbox setting. Most gradientbased evasion techniques (FGSM; BIM; PGSM; CW)
are whitebox because the adversary typically has access to model architecture and parameters/weights, which allows to query the model directly to decide how to increase the model’s loss function. Gradientbased strategies assume that the adversary has access to the gradient function of the model (i.e., whitebox access). The core idea is to find the perturbation vector
that maximizes the loss function of the model , where are the parameters (i.e., weights) of the model . In recent years, several whitebox adversarial sample crafting methods have been proposed, specially for image classification tasks. Some of the most notable ones are: Fast Gradient Sign Method (FGS) (FGSM), Basic Iterative Method (BIM) (BIM), Projected Gradient Descent (PGD) method (PGSM), and Carlini & Wagner (C&W) method (CW). Blackbox evasion techniques (e.g., MIM (MIM), HSJA (HSJA20), SPSA (uesato2018adversarial)) usually start from some initial perturbation , and subsequently probe on a series of perturbations , to craft such that misclassifies it to a label different from its original.In Section 3, we briefly introduce the seven reference evasion attacks we used as baselines for EGBooster.
3. Studied Baseline Attacks
In this section, we succinctly highlight the four whitebox and three blackbox attacks we use as baselines for EGBooster.
3.1. WhiteBox Attacks
FastGradient Sign Method (FGS) (FGSM) is a fast onestep method that crafts an adversarial example. Considering the dot product of the weight vector and an adversarial example (i.e., ), , the adversarial perturbation causes the activation to grow by . Goodfellow et al. (FGSM) suggested to maximize this increase subject to the maximum perturbation constraint by assigning . Given a sample , the optimal perturbation is given as follows:
(3) 
Basic Iterative Method (BIM) (BIM) was introduced as an improvement of FGS. In BIM, the authors suggest applying the same step as FGS multiple times with a small step size and clip the pixel values of intermediate results after each step to ensure that they are in an neighbourhood of the original image. Formally, the generated adversarial sample after iterations is given as follows:
(4) 
Projected Gradient Descent (PGD) (PGSM) is basically the same as BIM attack. The only difference is that PGD initializes the example to a random point in the ball of interest ^{1}^{1}1A ball is the volume space bounded by a sphere; it is also called a solid sphere (i.e., allowable perturbations decided by the norm) and does random restarts, while BIM initializes to the original point .
CarliniWagner (C&W) (CW) is one of the most powerful attacks, where the adversarial example generation problem is formulated as the following optimization problem:
(5) 
The goal is to find a small change such that when added to an image , the image is misclassified (to a targeted class ) by the model but the image is still a valid image. is some distance metric (e.g , or ). Due to the nonlinear nature of the classification function , authors defined a simpler objective function such that if and only if . Multiple options of the explicit definition of are discussed in the paper (CW) (e.g., ). Considering the norm as the distance , the optimization problem is simplified as follows:
(6) 
where c ¿ 0 is a suitably chosen constant.
3.2. BlackBox Attacks
Momentum Iterative Method (MIM) (MIM) is proposed as a technique for addressing the likely limitation of BIM: the greedy move of the adversarial example in the direction of the gradient in each iteration can easily drop it into poor local maxima. It does so by leveraging the momentum method (momentum) that stabilizes gradient updates by accumulating a velocity vector in the gradient direction of the loss function across iterations, and allows the algorithm to escape from poor local maxima. For a decay factor the gradient for iteration is updated as follows:
(7) 
where the adversarial example is obtained as .
HopSkipJump Attack (HSJA) (HSJA20)
is an attack that relies on prediction label only to create an adversarial example. The key intuition of HSJA is based on gradientdirection estimation, motivated by zerothorder optimization. Given an input
, first the gradient direction estimation is computed as follows:A perturbed input is a successful attack if and only if . The boundary between successful and unsuccessful perturbed inputs is given by bd() = . As an indicator of successful perturbation, the authors introduce the Booleanvalued function via:
The goal of an adversarial attack is to generate a perturbed sample
such that , while keeping close to the original input sample . This can be formulated as an optimization problem: such that , where is a distance metric that quantifies similarity.
Simultaneous Perturbation Stochastic Approximation (SPSA) by Uesato et al. (uesato2018adversarial)
is an attack that repurposes gradientfree optimization techniques into adversarial example attacks. They explore adversarial risk as a measure of the model’s performance on worstcase inputs. Since the exact adversarial risk is computationally intractable to exactly evaluate, they rather frame commonly used attacks (such as the ones we described earlier) and adversarial evaluation metrics to define a tractable surrogate objective to the true adversarial risk. The details of the algorithm are in
(uesato2018adversarial).4. ExplanationGuided Booster
In this section, we describe the details of EGBooster.
4.1. Overview
In the reference attacks we introduced in Section 3, without loss of generality, the problem of crafting an adversarial example can be stated as follows: Given an input such that , the goal of the attack is to find the optimal perturbation such that the adversarial example is misclassified as . This problem is formulated as:
(8) 
The only constraint of Equation 8 is that the perturbation size of the vector is bounded by a maximum allowable perturbation size (i.e., ), which ensures that adversarial manipulations preserve the semantics of the original input. However, it does not guarantee that all single feature perturbations (i.e., ) are consequential to result in evasion. Our ExplanationGuided Booster (EGBooster) approach improves Equation 8 to guide any norm attack to perform only necessary perturbations that can cause evasion. In addition to the upper bound constraint , EGBooster satisfies an additional constraint that guarantees the perturbation of only the features that are initially contributing to a correct prediction (positive features). In other words, EGBooster serves to guide the stateoftheart attacks towards perturbing only features with positive explanation weights where are the feature weights (explanations) of the input towards the true label (as explained in Section 2.1). Formally, the EGBooster adversarial example crafting problem is stated as follows:
(9) 
The second constraint ensures that only features with positive explanation weights are selected for perturbation.
As shown in Algorithm 1 (line 10), first, EGBooster performs ML explanation on the input , in order to gather the original direction of each feature with respect to the true label . Next, using explanation results (), EGBooster reviews initial perturbations performed by a baseline attack by eliminating features that have negative explanation weight (lines 16–25) and adding more perturbations on features that have positive explanation weight (lines 26–44). We ensure that EGBooster’s intervention is preconditioned on maintaining at least the same evasion success of an evasion attack, if not improving it. Particularly, in case the initial adversarial sample generated by a baseline attack succeeds to evade the model, EGBooster eliminates only perturbations that do not affect the initial evasion result. In this case, even if there exist unperturbed positive features when was crafted, EGBooster does not perform any additional perturbations since the evasion result has been already achieved.
Next, we use Algorithm 1 to further explain the details of EGBooster.
4.2. Eliminating NonConsequential Perturbations
As shown in Algorithm 1 (lines 16–25), EGBooster starts by eliminating unnecessary perturbations that are expected to be nonconsequential to the evasion result (if any). If the adversarial input produced by the baseline attack is already leading to a successful evasion, EGBooster ensures that the perturbation elimination step does not affect the initial evasion success (lines 19–20). Accordingly, EGBooster intervenes to eliminate the perturbations that have no effect on the final evasive prediction. More precisely, it searches for perturbed features that are originally not directed to the true label (line 17), and then it restores the original values of those features (i.e., eliminates the perturbation) (line 18) while ensuring the validity of the initial evasion if it exists (it does so by skipping any elimination of feature perturbations that do not preserve the model evasion). As we will show in our experiments (Section 5.3), this step can lead to a significant drop in the total number of perturbed features which results in a smaller perturbation size .
Figure 1 shows an MNISTCNN example of the impact of enhancing the FGS (FGSM) attack with EGBooster when the baseline perturbations are already evasive. In this example, FGS perturbations fool the CNN model to incorrectly predict digit ‘9’ as digit ‘8’. The red circles on the left hand side image show the detection of nonconsequential perturbations signaled by EGBooster using the preperturbation model explanation results. The new version of the input image shown on the right hand side of the figure illustrates the impact of eliminating those nonconsequential feature perturbations. The number of feature perturbations is reduced by 7% while preserving the evasion success.
The other alternative is when the baseline attack initially fails to evade the model. In such a case, it is crucial to perform this perturbation elimination step over all nonconsequential features before making any additional perturbations. Doing so reduces the perturbation size which provides a larger gap () for performing additional consequential perturbations that can cause a misclassification.
4.3. Adding Consequential Perturbations
This second step is only needed in case the adversarial sample has still failed to evade the model after eliminating nonconsequential perturbations. In this case, EGBooster starts searching for unperturbed features that are directed towards the true label . When it finds such features, it then incrementally adds small perturbations to these features since they are signaled by the ML explainer as crucial features directed to a correct classification (lines 28–29). The function get_delta(x) carefully chooses a random feature perturbation that satisfies the feature constraint(s) of the dataset at hand (line 28). For instance, in image classification, if the feature representation of the samples is in grayscale, pixel perturbation should be in the allowable range of feature values (e.g., [] for normalized MNIST features). In case a feature perturbation breaks the upper bound constraint , EGBooster iteratively keeps reducing the perturbation until the bound constraint is reached or the number of iterations hits its upper bound (lines 30–37).
In case all features directed to the correct label are already perturbed, EGBooster proceeds by attempting to increase the perturbations initially performed by the baseline attack on positive features. It does so while respecting the upper bound constraint (). This process immediately terminates whenever the evasion goal is achieved.
Figure 2 shows an MNISTCNN example of the impact of enhancing the FGS attack (FGSM) with EGBooster. In this example, FGS perturbations originally failed to fool the CNN model to predict an incorrect label (i.e., ‘7’). The right hand side image shows the perturbations (circled in green) added by EGBooster that are consequential to fool the CNN model into predicting the input image as ‘9’ instead of ‘7’.
4.4. Stability Analysis
As stated earlier (Section 2.1), when evaluated on securitysensitive learning tasks (e.g., malware classifiers), ML explanation methods have been found to exhibit output instability where by explanation weights of the same input sample can differ from one run to another (explainEval; explainEval2020). In order to ensure that EGBooster does not inherit the instability limitation of the ML explainer, we perform stability analysis where we compare the similarity of the prediction results returned by the target model after performing an explanationguided attack over multiple runs. More precisely, we define the kstability metric that quantifies the stability of EGBooster across k distinct runs. It measures the average runwise (e.g., each pair of runs) similarity between the returned predictions of the same adversarial sample after EGBooster attack. The similarity between two runs and , , is the intersection size of the predictions returned by the two runs over all samples in the test set (i.e., the number of matching predictions). Formally, we define the kstability as follows.
(10) 
The instability of explanation methods is mainly observed in the minor differences between the magnitude of the returned explanation weights of each feature which might sometimes lead to different feature ranking. However, the direction of a feature is less likely to change from one run to another as it is uniquely decided by the sign of its explanation weight . This is mainly visible on images as it is possible to plot feature directions with different colors on top of the original image. EGBooster only relies on the sign of the explanation weights in the process of detecting nonconsequential perturbations and candidates for consequential perturbations. As a result, the output of EGBooster is expected to be more stable than output of the underlying explanation method.
To validate our intuition we compare the stability of EGBooster with the stability of the employed explanation method. Thus, we compute the (k,l)Stability metric proposed by a prior work (explainEval2020) that evaluates the stability of ML explanation methods for malware classification models. The (k,l)Stability measure computes the average explanation similarity based on intersection of features that are ranked in the topl, returned by separate runs of an explanation method. More precisely, given two explanation results, and returned by two different runs and , and a parameter , their similarity is obtained based on the Dice coefficient:.
(11) 
where denotes the top features of sample with respect to the prediction returned by run . Over a total of runs, the average (k,l)Stability on a sample is defined as follows:
(12) 
More empirical discussions about the stability results over all samples in a test set are presented in Section 5.4.
5. Experimental Evaluation
In this section we report our findings through systematic evaluation of EGBooster. We first describe our experimental setup in Section5.1. In Section 5.2, we present details of EGBooster evasion accuracy effectivness. In Sections 5.3, 5.4, and 5.5, we present our findings on perturbation change, stability analysis, and execution time of EGBooster, respectively.
5.1. Experimental Setup
Datasets. We evaluate EGBooster on two standard datasets used for adversarial robustness evaluation of deep neural networks: MNIST (MNIST), a handwritten digit recognition task with ten class labels () and CIFAR10(cifar), an image classification task with ten classes. We use the K test samples of both datasets to evaluate the performance of EGBooster on undefended neural networks (described next). To evaluate EGBooster against a defended model, we use K samples of the CIFAR10 test set.
Models. For MNIST, we train a state of the art 7layer CNN model from “Model Zoo”^{2}^{2}2https://gitlab.com/secml/secml//tree/master/src/secml/model_zoo, provided by the SecML library (melis2019secml)
. It is composed of 3conv layers+ReLU, 1Flatten layer, 1fullyconnected layer+ReLU, a dropoutlayer (p=0.5), and finally a Flattenlayer. We call this model
MNISTCNN. It reaches a test accuracy of over all K test samples.For CIFAR10, we consider two different models: a CNN model (CIFARCNN) with 4conv2D and a benchmark adversariallytrained ResNet50 model. In our experimental results, we call the first undefended model CIFAR10CNN and the second defended model CIFAR10ResNet
. More precisely, the structure of CIFAR10CNN is: 2conv2D+ ReLU, 1MaxPool2D, 1dropout(p=0.25), 2conv2D+ReLU, 1MaxPool2D, 1dropout(p=0.25), 2fullyconnected+ReLU, 1dropout(p=0.25), and 1fullyconnected. To avoid the reduction of height and width of the images, padding is used in every convolutional layer. This model reaches a test accuracy of
after epochs. It outperforms the benchmark model adopted by Carlini & Wagner (2017)(CW) and Papernot et al. (2016) (distillation).CIFAR10ResNet is a stateoftheart adversariallytrained network model from the”robustness” library (robustness) proposed by MadryLab ^{3}^{3}3https://github.com/MadryLab/robustness/blob/master/robustness/cifar_models/ResNet.py. It is trained on images generated with PGD attack using norm and as perturbation bound. It reaches a test accuracy of over all K test set of CIFAR10.
Baseline Attacks. We evaluate EGBooster in a whitebox setting using 4 stateoftheart whitebox baseline attacks i.e., FGS (FGSM), BIM (BIM), PGD (PGSM), and C&W (CW). Furthermore, we validate EGBooster in a blackbox setting using 3 stateoftheart blackbox attacks: MIM (MIM), HSJA (HSJA20), and SPSA (uesato2018adversarial). Detailed explanations of all the 7 baseline attacks are provided in Section 3. We provide the hyperparameter choices of each attack in the Appendix (Table 4 in Section 7.1).
ML Explainer. The performance of EGBooster is influenced by the effectiveness of the employed ML explanation method in detecting the direction of each feature in a sample . It is, therefore, crucial to suitably choose the ML explainer according to the studied systems and the deployed models. In our experiments, we focus on image datasets and Neural Network models, therefore, we pick SHAP (SHAP)
, as it is proved to be effective in explaining deep neural networks, especially in the image classification domain. Furthermore, SHAP authors proposed a ML explainer called “Deep Explainer”, designed for deep learning models, specifically for image classification. SHAP has no access to the target model, which makes EGBooster suitable for either blackbox or whitebox threat model. We note that independent recent studies
(explainEval; explainEval2020) evaluated ML explanation methods for malware classifiers. They revealed that LIME outperforms other approaches in security systems in terms of accuracy and stability. Thus, we recommend using LIME for future deployment of EGBooster for robustness evaluation of ML malware detectors.Evaluation metrics. We use the following four metrics to evaluate EGBooster:
Evasion Rate: First, we quantify the effectiveness of EGBooster by monitoring changes to the percentage of successful evasions with respect to the total test set, from the baseline attacks to EGBooster attacks.
Average Perturbation Change: Second, we keep track of the average changes to the number of perturbed features using baseline attack versus using EGBooster attack to show the impact of the addition or elimination of perturbations performed by EGBooster. Referring to Algorithm 1 (lines 23 and 42), EGBooster keeps track of the number of added perturbations () and the number of eliminated perturbations () perimage. Thus, the total of perturbation changes perimage is , and the average perturbation change across all images of the test set () is:
(13) 
where is the total number of perturbations initially performed by the baseline attack on an example . When Average Perturbation Change, on average, EGBooster is performing more perturbation additions than eliminations (). Otherwise, it is performing more elimination of nonconsequential perturbations (). The average becomes zero when there are equal number of added and eliminated perturbations.
kStability & (k,l)Stability: To evaluate the reliability of EGBooster, we use the (Equation 10) and (Equation 12) measures we introduced in Section 4.4. We compute the of EGBooster for different values and we compare it to the value of the of the employed explanation method (i.e., SHAP), using the top10 features and different values of . Both metrics are calculated in average over samples.
Execution Time: Across both datasets and all models, we measure the perimage execution time (in seconds) taken by EGBooster for different baseline attacks.
5.2. Evasion Results
Tables 1, 2, and 3 report results of EGBooster on different networks; the undefended MINSTCNN, undefended CIFAR10CNN, and defended (adversariallytrained) CIFAR10ResNet, respectively.
Evasion rate in a nutshell: Across the three tables, we observe a significant increase in the evasion rate of studied baseline attacks after combining them with EGBooster. For instance, Table 1 shows an average increase of of the evasion rate across all attacks performed on the undefended MNISTCNN model. Similarly, the evasion rate results in Table 2, convey that, on average across all studied baseline attacks, more adversarial images are produced by EGBooster to evade the undefended CIFAR10CNN model. Same observations can be drawn from Table 3. More precisely, overall baseline attacks performed on the adversariallytrained CIFARResNet model, we observe an average of increase in the evasion rate, when combined with EGBooster. In a nutshell, our findings consistently suggest that explanation methods can be employed to effectively guide evasion attacks towards higher evasion accuracy.
EGBooster is consistently effective across model architectures, threat models, and distance metrics: Our results consistently suggest that EGBooster is agnostic to model architecture, threat model, and supports diverse distance metrics. For instance, the increase in evasion rate is observed on whitebox baseline attacks (i.e., FGS, BIM, PGD, and C&W) as well as for blackbox attacks (i.e., MIM, HSJA, and SPSA). Additionally, the improvement in baseline attacks performance is observed for different norms. For the same perturbation bound value and the same attack strategy (e.g., FGS, PGD), we notice an increase in the evasion rate regardless of the employed distance metric.
EGBooster is consistently effective over a range of perturbation bounds: Our experiments additionally cover the assessment of EGBooster for different values. Figure 3 reports the evasion rate curve of the stateoftheart attacks (e.g., PGD, MIM,etc) before and after being guided with EGBooster (see EGPGD, EGMIM, etc in Figure 3), using different perturbation bounds . For all target models trained on MNIST and CIFAR10, we observe a significant improvement in the evasion rate of all baseline attacks regardless of . However, for most attacks, we observe that, a higher perturbation bound which allows a greater perturbation size can lead to a higher evasion rate.
It is noteworthy that the increase in rate of evasion results is considerably higher for baseline attacks that initially have a low evasion rate. For instance, FGS and BIM that initially produced, respectively, and evasive adversarial samples from the total of 10K MNIST samples, have improved at least twofold after employing EGBooster. Similar observations can be drawn from CIFAR10CNN and CIFAR10ResNet. Particularly, for CIFAR10CNN, the increase rate of evasion results on FGS() is higher than the increase rate of FGS (). However, even though EGBooster results in higher increase rate in evasive samples for less preferment attacks, we note that, the total evasion rate of EGBooster combined with stronger attacks is more important. This is expected, as EGBooster builds on the initial results of baseline attacks which result in a correlation between the initial evasion rate of a baseline attack and the postEGBooster evasion rate. This is mainly observed for the C&W attack, as across all models, combined with C&W, EGBooster exhibits the highest evasion rates (i.e., ). Such correlation is also evident in the evasion rate curves from Figure 3.
EGBooster is still effective even against defended models: In Table 3, we focus our experiments on a defended model in order to examine the effectiveness of EGBooster against adversarial training defence. Since is used for adversarial training, CIFAR10ResNet is expected to be specifically robust against gradientbased attacks performed using norm. Incontrovertibly, and achieve considerably lower initial evasion rates on CIFAR10ResNet (), compared to undefended models(). Nevertheless, combined with EGBooster and result in more evasive samples (), even against a defended model. Same findings can be observed for other baseline attacks (e.g. , , etc). We conclude that EGBooster is still effective even against defended models. Given this promising results, we hope that EGBooster will be considered in the future as a benchmark for robustness assessment of ML models.
5.3. Perturbation Change
In addition to the evasion rate metric, we keep track of the perturbation changes introduced by EGBooster for each experiment. More precisely, we compute the average of perturbation change including the number of added and eliminated perturbations (Equation 13). Results from the last rows of Tables 1, 2 and 3 show that for most baseline attacks, EGBooster is making less perturbations while improving the evasion rate. This is explained by the negative sign of the average perturbation change for most of the studied baseline attacks across the three tables. These findings prove that, without considering the preperturbation feature direction explanations, baseline attacks are initially performing a considerable number of nonconsequential (unnecessary) perturbations. Consequently, in addition to the improvement of the evasion rate, these results demonstrate the importance of taking into account the features explanation weights in the formulation of evasion attacks (formulation 9).
It is noteworthy that the magnitude of the negative average perturbation change values is specifically more important for baseline attacks that initially have a high evasion rate. This is mainly true for the C&W attack across the 3 tables and in Table 2. We explain this observation by the fact that, EGBooster performs additional perturbations only in the case that the baseline adversarial sample originally fails to evade the model (Algorithm 1:line 27), otherwise, it would only perform the elimination of nonconsequential perturbations while maintaining the original evasion result.
In some of the experiments, we notice a positive value of the average perturbation change. This is particularly the case for and in Table 1. In this case, EGBooster performs more additional perturbations than eliminations which reflects the drastic improvement of the evasion rate for these two attacks. These findings particularly demonstrate the direct impact of perturbing features that are directed to the true label in confusing the model to make a correct prediction.
Focusing on Tables 1 and 2, we notice that, on average, the increase rate in evasion results for MNIST () is higher than CIFAR10 () using a CNN model for both datasets. Additionally, the average perturbation change for all experiments in Table 2 (i.e., CIFAR10CNN) is negative and have a higher magnitude than their counterparts in Table 1 (MNISTCNN). Further investigations have shown that these two observations are related. In particular, we found that baseline attacks performed on CIFAR10CNN are already perturbing a considerable number of features that are directed to the true label which makes the rate of added perturbations by the EGBooster in CIFAR10CNN lower than the ones added in MNISTCNN. This observation might explain the difference in evasion increase between both datasets. However, as discussed in section 4, EGBooster specifically examines this case i.e, the case where an important number of detected consequential features turn out to be already perturbed. More precisely, in case of initial evasion failure, EGBooster proceeds by additionally perturbing features that are already perturbed by the baseline attack while ensuring that the perturbation size is still within the bound ().
5.4. Stability Analysis
As highlighted by prior works (explainEval; explainEval2020) and thoroughly discussed in this paper, due to their stability concern, ML explanation methods should be carefully introduced when adopted in securitycritical settings. Thus, we evaluate the reliability of explanation methods to be employed for the robustness assessment of ML systems by conducting a comparative analysis between the stability of the employed ML explainer (i.e., SHAP) and the stability of our explanationbased method. In particular, we use the (kl)Stability and kStability metrics defined in Section 4.4, to respectively compute the output’s stability of SHAP, and EGBooster combined with baseline attacks, across different studied models.
EGBooster doesn’t inherit instabilities of SHAP: In Figure 4, we plot our stability analysis results. Focusing on the stability curves of EGBooster, we deduce that, it is almost stable for all studied models and across different baseline attacks. Consequently, the classification results returned by the target ML model are exactly the same across different runs of EGBooster. Such findings prove the reliability of our approach to be confidently adopted for improved robustness assessment of ML models against baseline attacks. Compared to the stability curves of SHAP, we observe that EGBooster does not inherit the instability concern indicated by ML explainer’s output. As discussed in Section 4.4, these findings are explained by the reliance only on the sign of explanation weights to decide the feature directions (i.e.,positive and negative
features). Since the distortions across different runs of the feature directions results returned by an accurate ML explainer are minor, the feature selection for perturbation performed by EGBooster returns the same feature sets across different runs which overall leads to the same evasion results.
Although EGBooster is not deeply influenced by the stability of the employed ML explainer, its success is, however, relying on the their accuracy to produce precise explanations, specifically precise feature directions in our case. Given the promising results, that our approach showed when relying on SHAP, we hope that future improvements in ML explanation methods would lead to an even better performance of EGBooster.
5.5. Execution Time
The measurements we present here are based on a hardware environment with 6 vCPUs, 18.5 GB memory, and 1 GPUNVIDIA Tesla K80.
In Figure 5, we plot the range of execution times of EGBooster recorded by different experiments. EGBooster takes, on average , and persample, respectively, for MNISTCNN, CIFAR10CNN, and CIFAR10ResNet (orange line). Overall, these observations reflect the efficiency of EGBooster, however, they additionally, reveal the impact of the input dimension and the model architecture on the running time of EGBooster. For instance, it takes more time on CIFAR10, compared to MNIST, which is due to the difference in input dimension between both datasets i.e., (28,28) for MNIST and (3,32,32) for CIFAR10. Additionally, for the same dataset (i.e., CIFAR10), we observe a considerable difference in execution time of EGBooster between CNN and ResNet50 (i.e., ) which is explained by the higher dimension of ResNet50 compared to CNN. This final result is expected, since EGBooster is a querybased approach (Algorithm 1: lines 10,13,19, and 27). Thus, the duration that a neural network takes to respond to a query influences the total execution time of EGBooster.
6. Related Work
In recent years, we have witnessed a flurry of adversarial example crafting methods (Carlinilist) and keeping uptodate on new research in this subdomain is a challenge. Nevertheless, in the following we position EGBooster with respect to closely related work.
Evasion Attacks. Here, we focus on benchmark attacks that EGBooster builds up on. While ML evasion attacks can be traced back to early 2000s (Wildpatterns18), recent improvements in accuracy of deep neural networks on nontrivial learning tasks have triggered adversarial manipulations as potential threats to ML models. Biggio et al. (BiggioECML13) introduced gradientbased evasion attacks against PDF malware detectors. Szegedy et al. (szegedy2014intriguing) demonstrated gradientbased attacks against image classifiers. Goodfellow et al. (FGSM) introduced FGSM, a fast onestep adversarial example generation strategy. Kurakin et al. (BIM) later introduced BIM as an iterative improvement of FGSM. Around the same time, Madry et al. (PGSM) proposed PGSM and Uesato et al. (PGSM) later introduced PGD. The common thread among these class of attacks and similar gradientbased ones is that they all aim to solve for an adversarial example misclassified with maximum confidence subject to a bounded perturbation size.
Another line of evasion attacks aim to minimize the perturbation distance subject to the evasion goal (e.g., classify a sample to a target label chosen by the adversary). Among such attacks, Carlini and Wagner (CW) introduced the C&W attack, one of the strongest whitebox attacks to date. Other attacks such as DeepFool (DeepFool16) and SparseFool (SparseFool19) use the same notion of minimum perturbation distance. Recent attacks introduced by Brendel et al. (AccurateFast19) and Croce and Hein (Croce020a) improve on the likes of DeepFool and SparseFool, both on accuracy and speed of adversarial example generation. EGBooster builds up on this body of adversarial example crafting attacks. As our evaluations suggest, it improves the evasion rate of major benchmark attacks used by prior work.
Pintor et al. (FMN21) propose the fast minimumnorm (FMN) attack that improves both accuracy and speed of benchmark gradientbased attacks across norms (), for both targeted and untargeted whitebox attacks. Like FMN, EGBooster covers MNIST and CIFAR10 datasets, evaluated on undefended and defended models, considers norm distance metrics, and on aggregate improves evasion accuracy of existing attacks. Unlike FMN, EGBooster covers blackbox and whitebox attacks, is not limited to gradientbased attacks, instead of proposing an attack strategy it rather leverages explanations to improve accuracy of existing attacks, and is limited to untargeted attacks.
Among blackbox evasion attacks, we highlight MIM (MIM), HSJA (HSJA20), and SPSA (uesato2018adversarial). MIM leverages the momentum method (momentum) that stabilizes gradient updates by accumulating a velocity vector in the gradient direction of the loss function across iterations, and allows the algorithm to escape from poor local maxima. HSJA relies on prediction label only to create adversarial examples, with key intuition based on gradientdirection estimation, motivated by zerothorder optimization. In SPSA, Uesato et al. (uesato2018adversarial) explore adversarial risk as a measure of the model’s performance on worstcase inputs, to define a tractable surrogate objective to the true adversarial risk.
ExplanationGuided Attacks. A recent body of work demonstrate how ML explanation methods can be harnessed to poison training data, extract model, and infer training datapoints. Severi et al. (EGPoisoning) leverage ML explanation methods to guide the selection of features to construct backdoor triggers that poison training data of malware classifiers. Milli et al. (EGmodelextraction19) demonstrate that gradientbased model explanations can be leveraged to approximate the underlying model. In a similar line of work, Aïvodji. et al. (EGmodelextraction20) show how counterfactual explanations can be used to extract a model. Very recently, Shokri et al. (EGmeminference2021)
study the tension between transparency via model explanations and privacy leakage via membership inference, and show backpropagationbased explanations can leak a significant amount of information about individual training datapoints.
Compared to this body of work, EGBooster is the first approach to leverage ML explanations to significantly improve accuracy of evasion attacks.Reliability Analysis of Explanation Methods. Focusing on robustness of explanations to adversarial manipulations, Ghorbani et al. (expRob1) and Heo et al. (expRob2) demonstrate that model explanation results are sensitive to small perturbations of explanations that preserve the predicted label. If ML explanations were provided to EGBooster as in “ExplanationasaService”, an adversary could potentially alter the explanation results, which might in effect influence our explanationguided attack. In EGBooster, baseline attacks and explanation methods are locally integrated into the robustness evaluation workflow (hence not disclosed to an adversary).
Recent studies by Warnecke et al. (explainEval) and Fan et al. (explainEval2020) systematically analyze the utility of ML explanation methods especially on models trained for securitycritical tasks (e.g., malware classifiers). In addition to general evaluation criteria (e.g., explanation accuracy and sparsity), Warnecke et al. (explainEval) focused on other securityrelevant evaluation metrics (e.g., stability, efficiency, and robustness). In a related line of work, Fan et al. (explainEval2020) performed a similar study on Android malware classifiers that led to similar conclusions.
7. Conclusion
In this paper, we introduce EGBooster, the first explanationguided booster for ML evasion attacks. Guided by featurebased explanations of model predictions, EGBooster significantly improves evasion accuracy of stateoftheart adversarial example crafting attacks by introducing consequential perturbations and eliminating nonconsequential ones. By systematically evaluating EGBooster on MNIST and CIFAR10, across whitebox and blackbox attacks against undefended and defended models, and using diverse distance metrics, we show that EGBooster significantly improves evasion accuracy of reference evasion attacks on undefended MNISTCNN and CIFAR10CNN models, and an adversariallytrained CIFAR10ResNet50 model. Furthermore, we empirically show the reliability of explanation methods to be adopted for robustness assessment of ML models. We hope that EGBooster will be used by future work as a benchmark for robustness evaluation of ML models against adversarial examples.
References
Appendix
7.1. Baseline Attacks HyperParameters
Besides the values of the employed distances and values that we specified in section 5, the parameters used for each baseline attacks are specified in Table 4:
Attack  HyperParameters 

FGS  , , which indicate the lower and upper bound of features. These two parameters are specified only for MNIST dataset. 
BIM 
. 
PGD  , , with solver parameters: (, , ) . 
MIM  , , , , ) . 
SPSA  , , . 
HSJA  , , , , , . 
Comments
There are no comments yet.