GenAttack: Practical Black-box Attacks with Gradient-Free Optimization

05/28/2018 ∙ by Moustafa Alzantot, et al. ∙ ibm Cooper Union 0

Deep neural networks (DNNs) are vulnerable to adversarial examples, even in the black-box case, where the attacker is limited to solely query access. Existing blackbox approaches to generating adversarial examples typically require a significant amount of queries, either for training a substitute network or estimating gradients from the output scores. We introduce GenAttack, a gradient-free optimization technique which uses genetic algorithms for synthesizing adversarial examples in the black-box setting. Our experiments on the MNIST, CIFAR-10, and ImageNet datasets show that GenAttack can successfully generate visually imperceptible adversarial examples against state-of-the-art image recognition models with orders of magnitude fewer queries than existing approaches. For example, in our CIFAR-10 experiments, GenAttack required roughly 2,568 times less queries than the current state-of-the-art black-box attack. Furthermore, we show that GenAttack can successfully attack both the state-of-the-art ImageNet defense, ensemble adversarial training, and non-differentiable, randomized input transformation defenses. GenAttack's success against ensemble adversarial training demonstrates that its query efficiency enables it to exploit the defense's weakness to direct black-box attacks. GenAttack's success against non-differentiable input transformations indicates that its gradient-free nature enables it to be applicable against defenses which perform gradient masking/obfuscation to confuse the attacker. Our results suggest that population-based optimization opens up a promising area of research into effective gradient-free black-box attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Deep neural networks (DNNs) have achieved state-of-the-art performance in various tasks in machine learning and artificial intelligence, such as image classification, speech recognition, machine translation and game-playing. Despite their effectiveness, recent studies have illustrated the vulnerability of DNNs to adversarial examples 

[Szegedy et al.2013, Goodfellow, Shlens, and Szegedy2014]. For instance, a virtually imperceptible perturbation to an image can lead a well-trained DNN to misclassify. Targeted adversarial examples can even cause misclassification to a chosen class. Moreover, researchers have shown that these adversarial examples are still effective in the physical world [Kurakin, Goodfellow, and Bengio2016, Athalye et al.2017], and can be crafted in distinct data modalities, such as in the natural language [Alzantot et al.2018], and speech [Alzantot, Balaji, and Srivastava2017] domains. The lack of robustness exhibited by DNNs to adversarial examples has raised serious concerns for security-critical applications.

Nearly all previous work on adversarial attacks, [Goodfellow, Shlens, and Szegedy2014, Moosavi-Dezfooli, Fawzi, and Frossard2016, Gu and Rigazio2014, Kurakin, Goodfellow, and Bengio2016] has used gradient-based optimization in order to find successful adversarial examples. However, gradient computation can only be performed when the attacker has full knowledge of the model architecture and weights. Thus, these methods are only applicable in the white-box setting, where an attacker is given full access and control over a targeted DNN. However, when attacking real-world systems, one needs to consider the problem of performing adversarial attacks in the black-box

setting, where nothing is revealed about the network architecture, parameters, or training data. In such a case, the attacker only has access to the input-output pairs of the classifier. Dominant approaches in this setting have relied on attacking trained substitute networks, and hoping the generated examples transfer to the target model 

[Papernot et al.2017]. This approach suffers from imperfect transferability and the computational cost of training a substitute network. Recent work has used coordinate-based finite difference methods in order to directly estimate the gradients from the confidence scores, however the attacks are still computationally expensive, relying on optimization tricks to remain tractable [Chen et al.2017b]. Both approaches are very query-intensive, thus limiting their practicality in real-world scenarios.

Motivated by the above, we present GenAttack, a novel approach to generating adversarial examples without having to compute or even approximate the gradients, enabling the solution to scale to the black-box case. In order to perform gradient-free optimization, we adopt a population-based approach using genetic algorithms, iteratively evolving a population of feasible solutions until success. By simultaneously pursuing multiple hypotheses, GenAttack is more resilient to poor local minima and can efficiently explore the solution space [Such et al.2018]. In addition, by being gradient-free, GenAttack is naturally robust to defenses which perform gradient masking or obfuscation [Athalye, Carlini, and Wagner2018]. Thus, unlike current approaches, GenAttack can efficiently craft perturbations in the black-box setting which can effectively fool not only state-of-the-art classifiers but recently proposed defenses which manipulate the gradients.

We evaluate GenAttack using state-of-the-art image classification models, and find that the algorithm is successful at performing targeted black-box attacks with significantly less queries than current approaches. For example, in our CIFAR-10 experiments, GenAttack required roughly 2,568 times less queries than the current state-of-the-art black-box attack. Furthermore, unlike current solutions, we find that GenAttack can successfully execute targeted attacks on the large-scale ImageNet dataset [Deng et al.2009], demonstrating its practicality. Additionally, we also demonstrate the success of GenAttack against ensemble adversarial training [Tramèr et al.2017], the state-of-the-art ImageNet defense, and randomized, non-differentiable input transformation defenses [Guo, Rana, and van der Maaten2017]. These results illustrate the power of GenAttack’s query efficiency and gradient-free nature.

In summary, we make the following contributions:

  • We introduce GenAttack, a novel gradient-free approach for generating adversarial examples by leveraging population-based optimization. Upon acceptance, our implementation will be released as open-source.

  • We show that in the restricted black-box setting, GenAttack can generate adversarial examples which force state-of-the-art image classification models to misclassify examples to chosen target labels with significantly less queries than current approaches.

  • We demonstrate that unlike current approaches, GenAttack can generate successful targeted adversarial examples on the large-scale ImageNet dataset, demonstrating its ability to perform in realistic scenarios.

  • We further highlight the effectiveness of GenAttack by illustrating its success against state-of-art ImageNet defenses, namely against ensemble adversarial training and non-differentiable, randomized input transformations. To the best of our knowledge, we are the first to present a successful black-box attack against these defenses.

Related Work

In what follows, we summarize recent approaches for generating adversarial examples, in both the white-box and black-box cases, as well as defending against adversarial examples. Please refer to the cited works for further detail.

White-box attacks & Transferability

In the white-box

case, attackers have complete knowledge of and full access to the targeted DNN. In this scenario, the adversary is able to use backpropagation for gradient computation, which obviously increases the strength of gradient-based attacks.

White-box attacks can also be used in black-box cases by taking advantage of transferability. Transferability refers to the property that adversarial examples generated using one model are often misclassified by another model. The substitute model approach to black-box attacks takes advantage of this property to generate successful adversarial examples.

L-Bfgs

[Szegedy et al.2013] argues that the reason for existence of adversarial examples is the occurrence of blind-spots in neural networks. They use box-constraint L-BFGS to solve the following optimization problem.

where is the classifier mapping function that maps input image to a discrete label.

is the target output label.

is the added noise.

Fgsm & I-Fgsm

In [Goodfellow, Shlens, and Szegedy2014], the authors proposed the Fast Gradient Sign Method (FGSM), a quick and reliable approach for generating adversarial examples. Let and denote the original and adversarial examples, respectively, and let denote the target class to attack. FGSM uses the gradient of the training loss with respect to for crafting adversarial examples. An attack, is crafted by

(1)

where specifies the distortion between and , and takes the sign of the gradient. Untargeted attacks can be implemented in a similar fashion. In [Kurakin, Goodfellow, and Bengio2016], an iterative version of FGSM was proposed (I-FGSM), where FGSM is used iteratively with a finer distortion, followed by an -ball clipping. In [Madry et al.2017], PGD is introduced, where I-FGSM is modified to incorporate random starts.

C&w & Ead

Instead of leveraging the training loss, Carlini and Wagner designed an

-regularized loss function based on the logit layer representation in DNNs for crafting adversarial examples 

[Carlini and Wagner2017]. Its formulation is as follows:

(2)

where is the logit layer loss function. By increasing

, one increases the necessary margin between the predicted probability of the target class and that of the rest, generating stronger adversarial examples with increased distortion. The untargeted attack formulation is similar. EAD generalizes the C&W attack by incorporating

minimization via performing elastic-net regularization [Chen et al.2017a], and has been shown to generate more robust, transferable adversarial examples [Sharma and Chen2017, Sharma and Chen2018, Lu et al.2018].

Black-box attacks

In the literature, the black-box attack setting has been referred to as the case where an attacker has free access to the input and output of a targeted DNN but is unable to perform back propagation on the network. Proposed approaches have relied on transferability and gradient estimation, and are summarized below.

Substitute Networks

Early approaches to black-box attacks made use of the power of free query to train a substitute model, a representative substitute of the targeted DNN [Papernot et al.2017]. The substitute DNN can then be attacked using any white-box technique, and the generated adversarial examples are used to attack the target DNN. As the substitute model is trained to be representative of a targeted DNN in terms of its classification rules, adversarial attacks to a substitute model are expected to be similar to attacking the corresponding targeted DNN. This approach however relies on the transferability property rather than directly attacking the target DNN, which is imperfect and thus limits the strength of the adversary. Furthermore, training a substitute model is computationally expensive and hardly feasible when attacking large models, such as Inception-v3 [Szegedy et al.2015].

Zoo

Due to the unfavorable properties of existing approaches, ZOO was proposed [Chen et al.2017b]. ZOO builds on the C&W attack, due to its state-of-the-art performance, and modifies the loss function such that it only depends on the output of the DNN, as opposed to depending on the logit layer representation. Furthermore, an approximate gradient is computed using the finite difference method on the targeted DNN, and the optimization problem is solved via zeroth order optimization.

For each coordinate, 2 function evaluations are required to estimate the gradient. Performing this computation for all coordinates quickly becomes too expensive in practice. To resolve this issue, stochastic coordinate descent is used, which only requires 2 function evaluations for each step. Still, when attacking large black-box networks, such as Inception-v3, computation is quite slow and thus a dimension reduction transformation on the perturbation is applied. With these optimizations, unlike the substitute model approach, attacking Inception-v3 becomes computationally tractable. However, as we demonstrate in our experimental results, the attack is still quite query-inefficient and thus limited in power and impractical for attacking real-world systems.

Recently we became aware of work that also aims to improve the efficiency and strength of black-box adversarial attacks, such as [Brendel, Rauber, and Bethge2018] and [Ilyas et al.2018]. However, our work remains unique in its goal and approach. Unlike us, [Brendel, Rauber, and Bethge2018] focus on attacking black-box models with only partial access to the query results, however they do not address the query efficiency problem. Notably, their method takes, on average, about 12x more queries than ours to achieve success against an undefended ImageNet model. Likewise, [Ilyas et al.2018] also target a similar contribution, but their approach is still reliant on gradient estimation and was not shown to succeed against state-of-the-art defenses. We treat these contributions as parallel work.

Defending against adversarial attacks

Adversarial Training

Adversarial training is typically implemented by augmenting the original training dataset with the label-corrected adversarial examples to retrain the network. In [Madry et al.2017], a high capacity network is trained against -constrained PGD, I-FGSM with random starts, which is deemed to be the strongest attack utilizing the local first-order information about the network. It has been shown that the defense is less robust to attacks optimized on other distortion metrics, namely  [Sharma and Chen2017]. In [Tramèr et al.2017], training data is augmented with perturbations transferred from other models, and was demonstrated to have strong robustness to transferred adversarial examples. We demonstrate in our experimental results that its less robust to query-efficient black-box attacks, such as GenAttack.

Gradient Obfuscation

It has been identified that many recently proposed defenses provide apparent robustness to strong adversarial attacks by manipulating the gradients to either be nonexistent or incorrect, dependent on test-time randomness, or simply unusable. Specifically, it was found in analyzing the ICLR 2018 non-certified defenses that claim white-box robustness, 7 of 9 relied on this phenomenon [Athalye, Carlini, and Wagner2018]. It has also been shown that adversarial training learns to succeed by making the gradients point in the wrong direction [Tramèr et al.2017].

One defense which relies upon gradient obfuscation is utilizing input transformations. In [Guo, Rana, and van der Maaten2017]

, transformations based on image cropping and rescaling, bit-depth reduction, JPEG compression, total variance minimization, and image quilting were explored. These transformations are clearly non-differentiable, forcing gradient-based attackers to rely upon transferability. In addition, transformations like total variance minimization are randomized. In the

white-box case, this defense can be successfully attacked by forward propagating through the transformation as usual, but on the backward pass, replacing the non-differentiable transformation with the identity function [Athalye, Carlini, and Wagner2018]. Though this approach is effective, it is only applicable when the attacker both has knowledge of the non-differentiable component and can backpropagate the network. We demonstrate in our experimental results that GenAttack, being gradient-free and thus impervious to said gradient manipulation, can naturally handle such procedures in the black-box case.

Threat Model

We consider the following attack scenario. The attacker does not have knowledge about the network architecture, parameters, or training data. The attacker is solely capable of querying the target model as a black-box function:

where is the number of input features and is the number of classes. The output of function is the set of model prediction scores. Note, that the attacker will not have access to intermediate values computed in the network hidden layers, including the logits.

The goal of the attacker is to perform a targeted attack. Formally speaking, given a benign input example that is correctly classified by the model, the attacker seeks to find a perturbed example for which the network will produce the desired target prediction chosen by the attacker from the set of labels . Additionally, the attacker also seeks to minimize the distance, in order to maintain the perceptual similarity between and . I.e.,

such that

where the distance norm function is often chosen as or .

This threat model is equivalent to that of prior work in black-box attacks [Chen et al.2017b, Papernot et al.2017], and is similar to the chosen-plain-text attack (CPA) in cryptography, where an attacker provides the victim with any chosen plain-text message and observes its encryption cipher output.

GenAttack Algorithm

GenAttack relies on genetic algorithms, which are population-based gradient-free optimization strategies. Genetic algorithms are inspired by the process of natural selection, iteratively evolving a population of candidate solutions towards better solutions. The population in each iteration is a called a generation. In each generation, the quality of population members is evaluated using a fitness function. “Fitter” solutions are more likely to be selected for breeding the next generation. The next generation is generated through a combination of crossover and mutation. Crossover is the process of taking more than one parent solution and producing a child solution from them; it is analogous to reproduction and biological crossover. In addition, at each iteration, a small random mutation to the population members occurs during evolution, according to a small user-defined mutation probability. This is done in order to increase the diversity of population members and provide better exploration of the search space.

Algorithm  1 describes the operation of GenAttack. The input for the algorithm is the original image and the target classification label chosen by the attacker. The algorithm computes an adversarial image such that the model classifies as and . We define the population size to be , the mutation probability to be , and the step-size to be .

GenAttack initializes a population of examples around the given input example

by applying independent and uniformly distributed random noise in the range

to each dimension of the input vector

with probability . Then repeatedly, until a successful example is found, each population members’ fitness is evaluated, parents are selected, and crossover & mutation are performed to form the next generation.

  Input: original example , target label , maximum distance , step-size , mutation probability , population size .
  for  in population do
     
  end for
  for  do
     for  in population do
        
     end for
     
     if   then
        Return: { Found successful attack}
     end if
     
     
     for  in population do
        
        
        
         {Apply mutations.}
         = +
         {Add mutated child to next generation}.
        
     end for
  end for
Algorithm 1 GenAttack (Targeted case)

The subroutine ComputeFitness evaluates the fitness, i.e. quality, of each population member. As the fitness function should reflect the optimization objective, a reasonable choice would be to use the output score given to the target class label directly. However, we find it more efficient to also jointly motivate the decrease in the probability of other classes. We also find that the use of proves to be helpful in avoiding numeric instability issues. Therefore, we pick the following function:

Population members at each iteration are ranked according to their fitness value. Members with higher fitness are more likely to be a part of the next generation while members with lower fitness are more likely to be replaced. We compute the probability of selection for each population member by normalizing the fitness values into a probability distribution. Then, we stochastically and independently select random parent pairs among the population members according to that distribution. In addition to that, the

elite member, the one with highest fitness, is guaranteed to become a member of the next generation.

After selection, parents are mated together to produce members of the next generation. A child is generated by selecting the feature value from either , or according to the selection probabilities:

To encourage diversity among the population members and promote exploration of the search space, at the end of each iteration, population members can be subject to mutation, according to probability . Random noise uniformly sampled in the range is applied to individual features of the chosen population member. Finally, clipping is performed to ensure that the pixel values are within the permissible distance away from the benign example .

Figure 1: MNIST adversarial examples generated by GenAttack. Row label is the true label and column label is the target label.
Figure 2: CIFAR-10 adversarial examples generated by GenAttack. Row label is the true label and column label is the target label.

Results

We evaluate GenAttack by running experiments attacking state-of-art MNIST, CIFAR-10, and ImageNet image classification models. For each dataset, we use the same models as used in the ZOO work [Chen et al.2017b]. For MNIST and CIFAR-10, the model accuracies are 99.5% and 80%, respectively. The reader can refer to [Carlini and Wagner2017] for more details on the architecture of those models. For ImageNet, we use Inception-v3 [Szegedy et al.2015], which achieves 94.4% top-5 accuracy and 78.8% top-1 accuracy. We compare the effectiveness of GenAttack to ZOO on these models in terms of the attack success rate (ASR), the runtime, and the median number of queries necessary for success. The runtime and query count statistics are computed over successful attacks only. A single query means an evaluation of the target model output on a single input image. Using the authors’ code 111https://github.com/huanzhang12/ZOO-Attack, we configure ZOO for each dataset based on the implementations the authors used for generating their experimental results [Chen et al.2017b]. We also evaluate against the state-of-the-art white-box C&W attack, assuming direct access to the model, to give perspective on runtime.

In addition, we evaluate the effectiveness of GenAttack against ensemble adversarial training [Tramèr et al.2017], using models released by the authors at the following link 222https://github.com/tensorflow/models/tree/master/research/adv_imagenet_models. Ensemble adversarial training is considered to be the state-of-art ImageNet defense against black-box attacks, proven to be effective at providing robustness against transfer attacks during the NIPS 2017 Competition on Defenses against Adversarial Attacks [Tramèr et al.2017, Kurakin et al.2018]. Finally, we evaluate against recently proposed randomized, non-differentiable input transformation defenses [Guo, Rana, and van der Maaten2017] to test GenAttack’s performance against gradient obfuscation. We find that GenAttack can handle such defenses as-is due to its gradient-free nature.

Hyperparameters

For all of our MNIST and CIFAR-10 experiments, we limit GenAttack

to a maximum of 100,000 queries, and fix the hyperparameters to the following values: mutation probability

, population size , and step-size . For all of our ImageNet experiments, as the images are nearly 100x larger than those of CIFAR-10, we use a maximum of 1,000,000 queries and lower to . To match the mean distortion computed over successful examples of ZOO, and thereby make the query comparison fair, we set , for our MNIST, CIFAR-10, and ImageNet experiments, respectively. To encourage further work and easily enable reproducibility, we are releasing our code as open source upon acceptance 333 https://github.com/nesl/adversarial_genattack.git .

Query Comparison

We compare GenAttack and ZOO by number of queries necessary to achieve success, and provide C&W white-box results to put the runtime in perspective. For all experiments, we use an AMD Threadripper 1950X CPU with a single NVIDIA GTX 1080 Ti GPU. For MNIST, CIFAR-10, and ImageNet, we use 1000, 1000, and 100 randomly selected correctly classified images from the test sets. For each image, we select a random target label. Table 1 shows the results of our experiment. The results show that both ZOO and GenAttack can succeed on the MNIST and CIFAR-10 datasets, however GenAttack is 2,126 times and 2,568 times more efficient on each. Furthermore, unlike ZOO, GenAttack is actually slightly computationally faster than C&W, a white-box attack, on MNIST and CIFAR-10. On ImageNet, ZOO is not able to succeed consistently in the targeted case, is still quite query inefficient, and has exceptional computational cost (see runtime)444 Each iteration, ZOO performs 2*128 queries at once. This would not be possible attacking a real-world system, as queries would have to be iterative, thus the runtime statistics are artificially low.. This is significant as it shows that, unlike current black-box approaches, GenAttack is efficient enough to effectively scale to ImageNet.

A randomly selected set of MNIST and CIFAR-10 test images and their associated adversarial examples targeted to each other label are shown in Figure 2 and Figure 2. An ImageNet test image with its associated adversarial example is shown in Figure 4.

MNIST () CIFAR-10 () ImageNet ()
ASR Queries Runtime ASR Queries Runtime ASR Queries Runtime
C&W 100% 0.006 hr 100% 0.006 hr 100% 0.025 hr
ZOO 98% 2,118,222 0.013 hr 93.3% 2,064,798 0.025 hr 18% 2,611,456 2.25 hr
GenAttack 100% 996 0.002 hr 96.5% 804 0.001 hr 100% 97,493 0.51 hr
Table 1: Attack success rate (ASR), median number of queries, and mean runtime for the C&W (white-box) attack, ZOO, and GenAttack with equivalent distortion. Query and runtime statistics are computed only over successful examples. Number of queries is not a concern for C&W because it is a white-box attack.
Figure 3: Adversarial example generated by GenAttack against the InceptionV3 model (). Left figure: original, right figure: adversarial example.
Figure 4: Adversarial example generated by GenAttack against the JPEG compression defense (). Left figure: original, right figure: adversarial example.
InceptionV3 Ens4AdvInceptionV3
ASR Queries ASR Queries
GenAttack 100% 97,493 93% 163,995
ZOO 18% 2,611,456 6% 3,584,623
C&W 100% - 100% -
Table 2: Comparison of GenAttack results (w/ ) against ZOO and C&W (white-box) in attacking the vanilla and ensemble adversarially trained Inception-v3 models. Query counts are computed over successful examples.

Attacking Ensemble Adversarial Training

Ensemble adversarial training incorporates adversarial inputs generated on other already trained models into the model’s training data in order to increase its robustness to adversarial examples [Tramèr et al.2017]. This has proven to be the most effective approach at providing robustness against transfer-based black-box attacks during the NIPS 2017 Competition. We demonstrate that the defense is much less robust against query-efficient black-box attacks, such as GenAttack.

We performed an experiment to evaluate the effectiveness of GenAttack against ensemble adversarially trained models released by the authors, namely Ens4AdvInceptionV3 and EnsAdvInceptionResNetv2. We use the same 100 randomly sampled test images and targets used in our previous ImageNet experiments. We find that GenAttack is able to achieve 93% success and 88% success against both models, respectively, significantly outperforming ZOO. In Table 2, we compare the success rate and median query count between the ensemble adversarially trained and the vanilla Inception-v3 models. Our comparison shows that these positive results are yielded with only a limited increase in query count. We additionally note that the max used for evaluation in the NIPS 2017 competition was varied between 4 and 16, which when normalized equals 0.02 and 0.06, respectively. Our (0.05) falls in this range.

CIFAR-10 ImageNet
ASR Queries ASR Queries
Bit depth 93% 2,796 100% 116,739
JPEG 88% 3,541 89% 190,680
TVM 70% 5,888 x 32
Table 3: Evaluation of GenAttack against non-differentiable and randomized input transformation defenses. We use for bit-depth, and for JPEG and TVM experiments.

Attacking Non-Differentiable, Randomized Input Transformations

Non-differentiable input transformations perform gradient obfuscation, relying upon manipulating the gradients to succeed against gradient-based attackers [Athalye, Carlini, and Wagner2018]. In addition, randomized transformations increase the difficulty for the attacker to guarantee success. One can circumvent such approaches by modifying the core defense module performing the gradient obfuscation, however this is clearly not applicable in the black-box case.

In [Guo, Rana, and van der Maaten2017], a number of input transformations were explored, including bit-depth reduction, JPEG compression, and total variance minimization. Bit-depth reduction and JPEG compression are non-differentiable, while total variance minimization introduces additional randomization and is quite slow, making it difficult to iterate upon. We demonstrate that GenAttack can succeed against these input transformations in the black-box case, due to its gradient-free nature making it impervious to gradient obfuscation. To the best of our knowledge, we are the first to demonstrate a black-box algorithm which can bypass such defenses. Our results are summarized in Table 3.

For bit-depth reduction, 3 bits were reduced, while for JPEG compression, the quality level was set to 75, as in [Guo, Rana, and van der Maaten2017]. GenAttack is able to achieve high success rate against both non-differentiable transformations, on both the CIFAR-10 and ImageNet datasets. A visual example of our results against JPEG compression is shown in Figure 4.

Total variance minimization (TVM) introduces an additional challenge as it is not only non-differentiable, but it also introduces randomization and is an exceedingly slow procedure. TVM randomly drops many of the pixels (dropout rate of 50%, as in  [Guo, Rana, and van der Maaten2017]) in the original image and reconstructs the input image from the remaining pixels by solving a denoising optimization problem. Due to randomization, the classifier returns a different score at each run for the same input, confusing the attacker. Succeeding against randomization requires more iterations, but iterating upon the defense is difficult due to the slow speed of the optimization.

However, as the ComputeFitness function can be generalized to be

where is the randomization-defended model query function and is the noise input to the TVM function, GenAttack can still handle this defense in the black-box case. The expectation is computed by querying the model times (we used ) for every population member to obtain a robust fitness score at the cost of an increased number of queries. Due to the computational complexity of applying TVM on each query, we performed the TVM experiment only using the CIFAR-10 dataset and achieved 70% success with = . Due to the large randomization introduced by TVM, we counted an adversarial example as success only if it is classified as the target label three times in a row. Notably, TVM significantly decreases the model accuracy on clean inputs (e.g. in our CIFAR-10 experiments, from 80% to 40%) unless the model is re-trained with transformed examples [Guo, Rana, and van der Maaten2017].

Comparison to ZOO and C&W:

Due to the non-differentiable nature of the input transformations, the C&W attack, a gradient-based attack, can not succeed without manipulating the non-differentiable component, as discussed in [Athalye, Carlini, and Wagner2018]. In the white-box case, this method can be applied to yield high success rate, but is not applicable in the more restricted black-box case. In the black-box setting, ZOO achieved 8% and 0% against the non-differentiable bit-depth reduction and JPEG compression defenses on ImageNet, again demonstrating its impracticality.

Conclusion

GenAttack is a powerful and efficient black-box attack which uses a gradient-free optimization scheme via adopting a population-based approach using genetic algorithms. We evaluated GenAttack by attacking well-trained MNIST, CIFAR-10, and ImageNet models, and found that GenAttack is successful at performing targeted black-box attacks against these models with not only significantly less queries than the current state-of-the-art, but additionally can succeed on ImageNet, which current approaches are incapable of scaling to. Furthermore, we demonstrate that GenAttack can succeed against ensemble adversarial training, the state-of-the-art ImageNet defense, with only a limited increase in queries. Finally, we showed that GenAttack can succeed against gradient obfuscation, due to its gradient-free nature, namely through evaluating against non-differentiable input transformations, and can even succeed against randomized ones by generalizing the fitness function to compute an expectation over the transformation. To the best of our knowledge, this is the first demonstration of a black-box attack which can succeed against these state-of-the-art defenses. Our results suggest that population-based optimization opens up a promising research direction into effective gradient-free black-box attacks.

References