1 Introduction
In recent years, “machine learning as a service” has offered the world an effortless access to powerful machine learning tools for a wide variety of tasks. For example, commercially available services such as Google Cloud Vision API and Clarifai.com provide welltrained image classifiers to the public. One is able to upload and obtain the class prediction results for images at hand at a low price. However, the existing and emerging machine learning platforms and their low modelaccess costs raise everincreasing security concerns, as they also offer an ideal environment for testing malicious attempts. Even worse, the risks can be amplified when these services are used to build derived products such that the inherent security vulnerability could be leveraged by attackers.
In many computer vision tasks, DNN models achieve the stateoftheart prediction accuracy and hence are widely deployed in modern machine learning services. Nonetheless, recent studies have highlighted DNNs’ vulnerability to adversarial perturbations. In the
whitebox setting in which the target model is entirely transparent to an attacker, visually imperceptible adversarial images can be easily crafted to fool a target DNN model towards misclassification by leveraging the input gradient information [Szegedy et al.2014, Goodfellow, Shlens, and Szegedy2015]. However, in the blackbox setting in which the parameters of the deployed model are hidden and one can only observe the inputoutput correspondences of a queried example, crafting adversarial examples requires a gradientfree (zeroth order) optimization approach to gather necessary attack information. Figure 1 displays a predictionevasive adversarial example crafted via iterative model queries from a blackbox DNN (the Inceptionv3 model [Szegedy et al.2016]) trained on ImageNet.Albeit achieving remarkable attack effectiveness by the use of gradient estimation, current blackbox attack methods, such as [Chen et al.2017, Nitin Bhagoji et al.2018], are not queryefficient since they exploit coordinatewise gradient estimation and value update, which inevitably incurs an excessive number of model queries and may give a false sense of model robustness due to inefficient query designs. In this paper, we propose to tackle the preceding problem by using AutoZOOM, an Autoencoderbased Zeroth Order Optimization Method. AutoZOOM has two novel building blocks: (i) a new and adaptive random gradient estimation strategy to balance the query counts and distortion when crafting adversarial examples, and (ii) an autoencoder that is either trained offline on other unlabeled data, or based on a simple bilinear resizing operation, in order to accelerate blackbox attacks. As illustrated in Figure 2, AutoZOOM utilizes a “decoder” to craft a highdimensional adversarial perturbation from the (learned) lowdimensional latentspace representation, and its query efficiency can be well explained by the dimensiondependent convergence rate in gradientfree optimization.
Contributions. We summarize our main contributions and new insights on adversarial robustness as follows:

We propose AutoZOOM, a novel queryefficient blackbox attack framework for generating adversarial examples. AutoZOOM features an adaptive random gradient estimation strategy and dimension reduction techniques (either an offline trained autoencoder or a bilinear resizer) to reduce attack query counts while maintaining attack effectiveness and visual similarity. To the best of our knowledge, AutoZOOM is the first blackbox attack using random full gradient estimation and datadriven acceleration.

We use the convergence rate of zerothorder optimization to motivate the query efficiency of AutoZOOM and provide an error analysis of the new gradient estimator in AutoZOOM to the true gradient for characterizing the tradeoffs between estimation error and query counts.

When applied to a stateoftheart blackbox attack proposed in [Chen et al.2017], AutoZOOM attains a similar attack success rate while achieving a significant reduction (at least 93%) in the mean query counts required to attack the DNN image classifiers for MNIST, CIFAR10 and ImageNet. It can also finetune the distortion in the postsuccess stage by performing finer gradient estimation.

In the experiments, we also find that AutoZOOM with a simple bilinear resizer as the decoder (AutoZOOMBiLIN) can attain noticeable query efficiency, despite that it is still worse than AutoZOOM with an offline trained autoencoder (AutoZOOMAE). However, AutoZOOMBiLIN is easier to be mounted as no additional training is required. The results also suggest an interesting finding that while learning effective lowdimensional representations of legitimate images is still a challenging task, blackbox attacks using significantly less degree of freedoms (i.e., reduced dimensions) are certainly plausible.
2 Related Work
Gradientbased adversarial attacks on DNNs fall within the whitebox setting, since acquiring the gradient with respect to the input requires knowing the weights of the target DNN. As a first attempt towards blackbox attacks, the authors in [Papernot et al.2017] proposed to train a substitute model using iterative model queries, performing whitebox attacks on the substitute model, and implementing transfer attacks to the target model [Papernot, McDaniel, and Goodfellow2016, Liu et al.2017]. However, its attack performance can be severely degraded due to poor attack transferability [Su et al.2018]. Although ZOO achieves a similar attack success rate and comparable visual quality as many whitebox attack methods [Chen et al.2017], its coordinatewise gradient estimation requires excessive target model evaluations and is hence not queryefficient. The same gradient estimation technique is also used in [Nitin Bhagoji et al.2018].
Beyond optimizationbased approaches, the authors in [Ilyas et al.2018]
proposed to use a natural evolution strategy (NES) to enhance query efficiency. Although there is a vectorwise gradient estimation step in the NES attack, we treat it as a parallel work since its natural evolutionary step is out of the scope of blackbox attacks using zerothorder gradient descent. We also note that different from NES, our AutoZOOM framework uses a theorydriven queryefficient randomvector based gradient estimation strategy. In addition, AutoZOOM could be applied to further improve the query efficiency of NES, since NES does not take into account the factor of attack dimension reduction, which is the novelty in AutoZOOM as well as the main focus of this paper.
Under a more restricted attack setting, where only the decision (top1 prediction class) is known to an attacker, the authors in [Brendel, Rauber, and Bethge2018] proposed a randomwalk based attack around the decision boundary. Such a blackbox attack dispenses class prediction scores and hence requires additional model queries. Due to space limitation, we provide more background and a table comparing existing blackbox attacks in the supplementary material.
3 AutoZOOM: Background and Methods
3.1 Blackbox Attack Formulation and Zeroth Order Optimization
Throughout this paper, we focus on improving the query efficiency of gradientestimation and gradientdescent based blackbox attacks empowered by AutoZOOM, and we consider the threat model that the class prediction scores are known to an attacker. In this setting, it suffices to denote the target DNN as a classification function that takes a dimensional scaled image as its input and yields a vector of prediction scores of all
image classes, such as the prediction probabilities for each class. We further consider the case of applying an entrywise monotonic transformation
to the output of for blackbox attacks, since monotonic transformation preserves the ranking of the class predictions and can alleviate the problem of large score variation in (e.g., probability to log probability).Here we formulate blackbox targeted attacks. The formulation can be easily adapted to untargeted attacks. Let denote a natural image and its groundtruth class label , and let () denote the adversarial example of and the target attack class label . The problem of finding an adversarial example can be formulated as an optimization problem taking the generic form of
(1) 
where measures the distortion between and , is an attack objective reflecting the likelihood of predicting , is a regularization coefficient, and the constraint confines the adversarial image to the valid image space. The distortion is often evaluated by the norm defined as for , where is the adversarial perturbation to . The attack objective can be the training loss of DNNs [Goodfellow, Shlens, and Szegedy2015] or some designed loss based on model predictions [Carlini and Wagner2017b].
In the whitebox setting, an adversarial example is generated by using downstream optimizers such as ADAM [Kingma and Ba2015] to solve (1); this requires the gradient of the objective function relative to the input of via backpropagation in DNNs. However, in the blackbox setting, acquiring is implausible, and one can only obtain the function evaluation , which renders solving (1) a zeroth order optimization problem. Recently, zeroth order optimization approaches [Ghadimi and Lan2013, Nesterov and Spokoiny2017, Liu et al.2018] circumvent the preceding challenge by approximating the true gradient via function evaluations. Specifically, in blackbox attacks, the gradient estimate is applied to both gradient computation and descent in the optimization process for solving (1).
3.2 Random Vector based Gradient Estimation
As a first attempt to enable gradientfree blackbox attacks on DNNs, the authors in [Chen et al.2017] use the symmetric difference quotient method [Lax and Terrell2014] to evaluate the gradient of the th component by
(2) 
using a small . Here denotes the th elementary basis. Albeit contributing to powerful blackbox attacks and applicable to large networks like ImageNet, the nature of coordinatewise gradient estimation step in (2) must incur an enormous amount of model queries and is hence not queryefficient. For example, the ImageNet dataset has input dimensions, rendering coordinatewise zeroth order optimization based on gradient estimation queryinefficient.
To improve query efficiency, we dispense with coordinatewise estimation and instead propose a scaled random full gradient estimator of , defined as
(3) 
where is a smoothing parameter, is a unitlength vector that is uniformly drawn at random from a unit Euclidean sphere, and
is a tunable scaling parameter that balances the bias and variance tradeoff of the gradient estimation error. Note that with
, the gradient estimator in (3) becomes the one used in [Duchi et al.2015]. With , this estimator becomes the one adopted in [Gao, Jiang, and Zhang2014]. We will provide an optimal value for balancing query efficiency and estimation error in the following analysis.Averaged random gradient estimation. To effectively control the error in gradient estimation, we consider a more general gradient estimator, in which the gradient estimate is averaged over random directions . That is,
(4) 
where is a gradient estimate defined in (3) with . The use of multiple random directions can reduce the variance of in (4
) for convex loss functions
[Duchi et al.2015, Liu et al.2018].Below we establish an error analysis of the averaged random gradient estimator in (4) for studying the influence of the parameters and on estimation error and query efficiency.
Theorem 1.
Assume is differentiable and its gradient is Lipschitz^{1}^{1}1A function is Lipschitz if for any
. For DNNs with ReLU activations,
can be derived from the model weights [Szegedy et al.2014].. Then the mean squared estimation error of in (4) is upper bounded by(5) 
Proof.
The proof is given in the supplementary file. ∎
Here we highlight the important implications based on Theorem 1: (i) The error analysis holds when is nonconvex; (ii) In DNNs, the true gradient can be viewed as the numerical gradient obtained via backpropagation; (iii) For any fixed , selecting a small (e.g., we set in AutoZOOM) can effectively reduce the last error term in (1), and we therefore focus on optimizing the first error term; (iv) The first error term in (1) exhibits the influence of and on the estimation error, and is independent of . We further elaborate on (iv) as follows. Fixing and let to be the coefficient of the first error term in (1), then the optimal that minimizes is . For query efficiency, one would like to keep small, which then implies and when the dimension is large. On the other hand, when , and , which yields a smaller error upper bound but is queryinefficient. We also note that by setting , the coefficient and thus is independent of the dimension and the parameter .
Adaptive random gradient estimation. Based on Theorem 1 and our error analysis, in AutoZOOM we set in (3) and propose to use an adaptive strategy for selecting . AutoZOOM uses (i.e., the fewest possible model evaluation) to first obtain rough gradient estimates for solving (1) until a successful adversarial image is found. After the initial attack success, it switches to use more accurate gradient estimates with to finetune the image quality. The tradeoff between (which is proportional to query counts) and distortion reduction will be investigated in Section 4.
3.3 Attack Dimension Reduction via Autoencoder
Dimensiondependent convergence rate using gradient estimation. Different from the first order convergence results, the convergence rate of zeroth order gradient descent methods has an additional multiplicative dimensiondependent factor . In the convex loss setting the rate is , where is the number of iterations [Nesterov and Spokoiny2017, Liu et al.2018, Gao, Jiang, and Zhang2014, Wang et al.2018]. The same convergence rate has also been found in the nonconvex setting [Ghadimi and Lan2013]. The dimensiondependent convergence factor suggests that vanilla blackbox attacks using gradient estimations can be query inefficient when the (vectorized) image dimension
is large, due to the curse of dimensionality in convergence. This also motivates us to propose using an autoencoder to reduce the attack dimension and improve query efficiency in blackbox attacks.
In AutoZOOM, we propose to perform random gradient estimation from a reduced dimension to improve query efficiency. Specifically, as illustrated in Figure 2, the additive perturbation to an image is actually implemented through a “decoder” such that , where . In other words, the adversarial perturbation to is in fact generated from a dimensionreduced space, with an aim of improving query efficiency due to the reduced dimensiondependent factor in the convergence analysis.
AutoZOOM provides two modes for such a decoder :
An autoencoder (AE) trained on unlabeled data that are different from the training data to learn reconstruction from a dimensionreduced representation. The encoder in an AE compresses the data to a lowdimensional latent space and the decoder reconstructs an example from its latent representation. The weights of an AE are learned to minimize the average reconstruction error. Note that training such an AE for blackbox adversarial attacks is onetime and is entirely offline (i.e., no model queries needed).
A simple channelwise bilinear image resizer (BiLIN) that scales a small image to a large image via bilinear extrapolation^{2}^{2}2See tf.image.resize_images
, a TensorFlow example.
. Note that no additional training is required for BiLIN.Why AE? Our proposal of AE is motivated by the insightful findings in [Goodfellow, Shlens, and Szegedy2015] that a successful adversarial perturbation is highly relevant to some humanimperceptible noise pattern resembling the shape of the target class, known as the “shadow”. Since a decoder in AE learns to reconstruct data from latent representations, it can also provide distributional guidance for mapping adversarial perturbations to generate these shadows.
We also note that for any reduced dimension , the setting is optimal in terms of minimizing the corresponding estimation error from Theorem 1, despite the fact that the gradient estimation errors of different reduced dimensions cannot be directly compared. In Section 4 we will report the superior query efficiency in blackbox attacks achieved with the use of AE or BiLIN as the decoder, and discuss the benefit of attack dimension reduction.
3.4 AutoZOOM Algorithm
Algorithm 1 summarizes the AutoZOOM framework towards queryefficient blackbox attacks on DNNs. We also note that AutoZOOM is a general acceleration tool that is compatible with any gradientestimation based blackbox adversarial attack obeying the attack formulation in (1). It also has some theoretical estimation error guarantees and queryefficient parameter selection based on Theorem 1. The details on adjusting the regularization coefficient and the query parameter based on runtime model evaluation results will be discussed in Section 4. Our source code is publicly available^{3}^{3}3https://github.com/IBM/AutozoomAttack.
4 Performance Evaluation
This section presents the experiments for assessing the performance of AutoZOOM in accelerating blackbox attacks on DNNs in terms of the number of queries required for an initial attack success and for a specific distortion level.
4.1 Distortion Measure and Attack Objective
As described in Section 3, AutoZOOM is a queryefficient gradientfree optimization framework for solving the blackbox attack formulation in (1). In the following experiments, we demonstrate the utility of AutoZOOM by using the same attack formulation proposed in ZOO [Chen et al.2017], which uses the squared norm as the distortion measure and adopts the attack objective
(6) 
where this hinge function is designed for targeted blackbox attacks on the DNN model , and the monotonic transformation is applied to the model output.
4.2 Comparative Blackbox Attack Methods
We compare AutoZOOMAE () and AutoZOOMBiLIN () with two different baselines: (i) Standard ZOO implementation^{4}^{4}4https://github.com/huanzhang12/ZOOAttack with bilinear scaling (same as BiLIN) for dimension reduction; (ii) ZOO+AE, which is ZOO with AE. Note that all attacks indeed generate adversarial perturbations based on the same reduced attack dimension.
4.3 Experiment Setup, Evaluation, Datasets and AutoZOOM Implementation
We assess the performance of different attack methods on several representative benchmark datasets, including MNIST [LeCun et al.1998], CIFAR10 [Krizhevsky2009] and ImageNet [Russakovsky et al.2015]. For MNIST and CIFAR10, we use the same DNN image classification models^{5}^{5}5https://github.com/carlini/nn_robust_attacks as in [Carlini and Wagner2017b]. For ImageNet, we use the Inceptionv3 model [Szegedy et al.2016]. All experiments were conducted using TensorFlow MachineLearning Library [Abadi et al.] on machines equipped with an Intel Xeon E52690v3 CPU and an Nvidia Tesla K80 GPU.
All attacks used ADAM [Kingma and Ba2015] for solving (1) with their estimated gradients and the same initial learning rate . On MNIST and CIFAR10, all methods adopt 1,000 ADAM iterations. On ImageNet, ZOO and ZOO+AE adopt 20,000 iterations, whereas AutoZOOMBiLIN and AutoZOOMAE adopt 100,000 iterations. Note that due to different gradient estimation methods, the query counts (i.e., the number of model evaluations) per iteration of a blackbox attack may vary. ZOO and ZOO+AE use the parallel gradient update of (2) with a batch of pixels, yielding 256 query counts per iteration. AutoZOOMBiLIN and AutoZOOMAE use the averaged random full gradient estimator in (4), resulting in query counts per iteration. For a fair comparison, the query counts are used for performance assessment.
Query reduction ratio. We use the mean query counts of ZOO with the smallest as the baseline for computing the query reduction ratio of other methods and configurations.
TPR and initial success. We report the true positive rate (TPR), which measures the percentage of successful attacks fulfilling a predefined constraint on the normalized (perpixel) distortion, as well as their query counts of first successes. We also report the perpixel distortions of initial successes, where an initial success refers to the first query count that finds a successful adversarial example.
Postsuccess finetuning. When implementing AutoZOOM in Algorithm 1, on MNIST and CIFAR10 we find that AutoZOOM without finetuning (i.e., ) already yields similar distortion as ZOO. We note that ZOO can be viewed as coordinatewise finetuning and is thus queryinefficient. On ImageNet, we will investigate the effect of postsuccess finetuning on reducing distortion.
Autoencoder Training. In AutoZOOMAE, we use convolutional autoencoders for attack dimension reduction, which are trained on unlabeled datasets that are different from the training dataset and the attacked natural examples. The implementation details are given in the supplementary material.
Dynamic Switching on . To adjust the regularization coefficient in (1), in all methods we set its initial value on MNIST and CIFAR10, and set on ImageNet. Furthermore, for balancing the distortion Dist and the attack objective Loss in (1), we use a dynamic switching strategy to update during the optimization process. Per every iterations, is multiplied by 10 times of the current value if the attack has never been successful. Otherwise, it divides its current value by 2. On MNIST and CIFAR10, we set . On ImageNet, we set . At the instance of initial success, we also reset and the ADAM parameters to the default values, as doing so can empirically reduce the distortion for all attack methods.
4.4 Blackbox Attacks on MNIST and CIFAR10
For both MNIST and CIFAR10, we randomly select 50 correctly classified images from their test sets, and perform targeted attacks on these images. Since both datasets have 10 classes, each selected image is attacked 9 times, targeting at all but its true class. For all attacks, the ratio of reduced attackspace dimension to the original one (i.e., ) is 25% for MNIST and 6.25% for CIFAR10.
Table 1 shows the performance evaluation on MNIST with various values of , the initial value of the regularization coefficient in (1). We use the performance of ZOO with as a baseline for comparison. For example, with and , the mean query counts required by AutoZOOMAE to attain an initial success is reduced by 93.21% and 98.57%, respectively. One can also observe that allowing larger generally leads to fewer mean query counts at the price of slightly increased distortion for the initial attack. The noticeable huge difference in the required attack query counts between AutoZOOM and ZOO/ZOO+AE validates the effectiveness of our proposed random full gradient estimator in (3), which dispenses with the coordinatewise gradient estimation in ZOO but still remains comparable true positive rates, thereby greatly improving query efficiency.
For CIFAR10, we report similar query efficiency improvements as displayed in Table 2. In particular, comparing the two queryefficient blackbox attack methods (AutoZOOMBiLIN and AutoZOOMAE), we find that AutoZOOMAE is more queryefficient than AutoZOOMBiLIN, but at the cost of an additional AE training step. AutoZOOMAE achieves the highest attack success rates (ASRs) and mean query reduction ratios for different values of . In addition, their true positive rates (TPRs) are similar but AutoZOOMAE usually takes fewer query counts to reach the same distortion. We note that when , AutoZOOMAE has a higher TPR but also needs slightly more mean query counts than AutoZOOMBiLIN to reach the same distortion. This suggests that there are some adversarial examples that are difficult for a bilinear resizer to reduce their postsuccess distortions but can be handled by an AE.
4.5 Blackbox Attacks on ImageNet
We selected 50 correctly classified images from the ImageNet test set to perform random targeted attacks and set and the attack dimension reduction ratio to 1.15%. The results are summarized in Table 3. Note that comparing to ZOO, AutoZOOMAE can significantly reduce the query count required to achieve an initial success by 99.39% (or 99.35% to reach the same distortion), which is a remarkable improvement since this means reducing more than 2.2 million model queries given the fact that the dimension of ImageNet ( 270K) is much larger than that of MNIST and CIFAR10.
Postsuccess distortion refinement. As described in Algorithm 1, adaptive random gradient estimation is integrated in AutoZOOM, offering a quick initial success in attack generation followed by a finetuning process to effectively reduce the distortion. This is achieved by adjusting the gradient estimate averaging parameter in (4) in the postsuccess stage. In general, averaging over more random directions (i.e., setting larger ) tends to better reduce the variance of gradient estimation error, but at the cost of increased model queries. Figure 3 (a) shows the mean distortion against query counts for various choices of in the postsuccess stage. The results suggest that setting some small but can further decrease the distortion at the converged phase when compared with the case of . Moreover, the refinement effect on distortion empirically saturates at , implying a marginal gain beyond this value. These findings also demonstrate that our proposed AutoZOOM indeed strikes a balance between distortion and query efficiency in blackbox attacks.
4.6 Dimension Reduction and Query Efficiency
In addition to the motivation from the convergence rate in zerothorder optimization (Sec. 3.3), as a sanity check, we corroborate the benefit of attack dimension reduction to query efficiency in blackbox attacks by comparing AutoZOOM (here we use ) with its alternative operated on the original (nonreduced) dimension (i.e., ). Tested on all three datasets and aforementioned settings, Figure 3 (b) shows the corresponding mean query count to initial success and the mean query reduction ratio when in all three datasets. When compared to the attack results of the original dimension, attack dimension reduction through AutoZOOM reduces roughly 3540% query counts on MNIST and CIFAR10 and at least 95% on ImageNet. This result highlights the importance of dimension reduction towards queryefficient blackbox attacks. For example, without dimension reduction, the attack on the original ImageNet dimension cannot even be successful within the query budge ( queries).
4.7 Additional Remarks and Discussion
In addition to benchmarking on initial attack success, the query reduction ratio when reaching the same distortion can be directly computed from the last column in each table.
The attack gain in AutoZOOMAE versus AutoZOOMBiLIN could sometimes be marginal, while we also note that there is room for improving AutoZOOMAE by exploring different AE models. However, we advocate AutoZOOMBiLIN as a practically ideal candidate for queryefficient blackbox attacks when testing model robustness, due to its easytomount nature and it has no additional training cost.
While learning effective lowdimensional representations of legitimate images is still a challenging task, blackbox attacks using significantly less degree of freedoms (i.e., reduced dimensions), as demonstrated in this paper, are certainly plausible, leading to new implications on model robustness.
5 Conclusion
AutoZOOM is a generic attack acceleration framework that is compatible with any gradientestimation based blackbox attack having the general formulation in (1). It adopts a new and adaptive random full gradient estimation strategy to strike a balance between query counts and estimation errors, and features a decoder (AE or BiLIN) for attack dimension reduction and algorithmic convergence acceleration. Compared to a stateoftheart attack (ZOO), AutoZOOM consistently reduces the mean query counts when attacking blackbox DNN image classifiers for MNIST, CIFAT10 and ImageNet, attaining at least query reduction in finding initial successful adversarial examples (or reaching the same distortion) while maintaining a similar attack success rate. It can also efficiently finetune the image distortion to maintain high visual similarity to the original image. Consequently, AutoZOOM provides novel and efficient means for assessing the robustness of deployed machine learning models.
Acknowledgements
ShinMing Cheng was supported in part by the Ministry of Science and Technology, Taiwan, under Grants MOST 1072218E001005 and MOST 1072218E011012. ChoJui Hsieh and Huan Zhang acknowledge the support by NSF IIS1719097, Intel faculty award, Google Cloud and NVIDIA.
References
 [Abadi et al.] Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for largescale machine learning.
 [Athalye, Carlini, and Wagner2018] Athalye, A.; Carlini, N.; and Wagner, D. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ICML.

[Baluja and
Fischer2018]
Baluja, S., and Fischer, I.
2018.
Adversarial transformation networks: Learning to generate adversarial examples.
AAAI. 
[Biggio and Roli2018]
Biggio, B., and Roli, F.
2018.
Wild patterns: Ten years after the rise of adversarial machine learning.
Pattern Recognition 84:317–331.  [Biggio et al.2013] Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, 387–402.
 [Brendel, Rauber, and Bethge2018] Brendel, W.; Rauber, J.; and Bethge, M. 2018. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. ICLR.

[Carlini and
Wagner2017a]
Carlini, N., and Wagner, D.
2017a.
Adversarial examples are not easily detected: Bypassing ten detection
methods.
In
ACM Workshop on Artificial Intelligence and Security
, 3–14.  [Carlini and Wagner2017b] Carlini, N., and Wagner, D. 2017b. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 39–57.
 [Chen et al.2017] Chen, P.Y.; Zhang, H.; Sharma, Y.; Yi, J.; and Hsieh, C.J. 2017. ZOO: Zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In ACM Workshop on Artificial Intelligence and Security, 15–26.
 [Chen et al.2018] Chen, P.Y.; Sharma, Y.; Zhang, H.; Yi, J.; and Hsieh, C.J. 2018. EAD: elasticnet attacks to deep neural networks via adversarial examples. AAAI.
 [Duchi et al.2015] Duchi, J. C.; Jordan, M. I.; Wainwright, M. J.; and Wibisono, A. 2015. Optimal rates for zeroorder convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61(5):2788–2806.
 [Gao, Jiang, and Zhang2014] Gao, X.; Jiang, B.; and Zhang, S. 2014. On the informationadaptive variants of the admm: an iteration complexity perspective. Optimization Online 12.
 [Ghadimi and Lan2013] Ghadimi, S., and Lan, G. 2013. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4):2341–2368.
 [Goodfellow, Shlens, and Szegedy2015] Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explaining and harnessing adversarial examples. ICLR.
 [Ilyas et al.2018] Ilyas, A.; Engstrom, L.; Athalye, A.; and Lin, J. 2018. Blackbox adversarial attacks with limited queries and information. ICML.
 [Ilyas2018] Ilyas, A. 2018. Circumventing the ensemble adversarial training defense. https://github.com/andrewilyas/ensadvtrainattack.
 [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. ICLR.
 [Krizhevsky2009] Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto.
 [Kurakin, Goodfellow, and Bengio2017] Kurakin, A.; Goodfellow, I.; and Bengio, S. 2017. Adversarial machine learning at scale. ICLR.
 [Lax and Terrell2014] Lax, P. D., and Terrell, M. S. 2014. Calculus with applications. Springer.
 [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Liu et al.2017] Liu, Y.; Chen, X.; Liu, C.; and Song, D. 2017. Delving into transferable adversarial examples and blackbox attacks. ICLR.
 [Liu et al.2018] Liu, S.; Chen, J.; Chen, P.Y.; and Hero, A. O. 2018. Zerothorder online alternating direction method of multipliers: Convergence analysis and applications. AISTATS.
 [Lowd and Meek2005] Lowd, D., and Meek, C. 2005. Adversarial learning. In ACM SIGKDD international conference on Knowledge discovery in data mining, 641–647.
 [Narodytska and Kasiviswanathan2016] Narodytska, N., and Kasiviswanathan, S. P. 2016. Simple blackbox adversarial perturbations for deep networks. arXiv preprint arXiv:1612.06299.
 [Nesterov and Spokoiny2017] Nesterov, Y., and Spokoiny, V. 2017. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics 17(2):527–566.
 [Nitin Bhagoji et al.2018] Nitin Bhagoji, A.; He, W.; Li, B.; and Song, D. 2018. Practical blackbox attacks on deep neural networks using efficient query mechanisms. In ECCV, 154–169.
 [Papernot et al.2017] Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z. B.; and Swami, A. 2017. Practical blackbox attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, 506–519.
 [Papernot, McDaniel, and Goodfellow2016] Papernot, N.; McDaniel, P.; and Goodfellow, I. 2016. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277.
 [Russakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252.
 [Su et al.2018] Su, D.; Zhang, H.; Chen, H.; Yi, J.; Chen, P.; and Gao, Y. 2018. Is robustness the cost of accuracy?  A comprehensive study on the robustness of 18 deep image classification models. In ECCV.
 [Suya et al.2017] Suya, F.; Tian, Y.; Evans, D.; and Papotti, P. 2017. Querylimited blackbox attacks to classifiers. NIPS Workshop.
 [Szegedy et al.2014] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing properties of neural networks. ICLR.
 [Szegedy et al.2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In CVPR, 2818–2826.
 [Tramèr et al.2018] Tramèr, F.; Kurakin, A.; Papernot, N.; Boneh, D.; and McDaniel, P. 2018. Ensemble adversarial training: Attacks and defenses. ICLR.
 [Wang et al.2018] Wang, Y.; Du, S.; Balakrishnan, S.; and Singh, A. 2018. Stochastic zerothorder optimization in high dimensions. AISTATS.
Supplementary Material
Appendix A More Background on Adversarial Attacks and Defenses
The research in generating adversarial examples to deceive machinelearning models, known as adversarial attacks, tends to evolve with the advance of machinelearning techniques and new publicly available datasets. In [Lowd and Meek2005], the authors studied adversarial attacks to linear classifiers with continuous or Boolean features. In [Biggio et al.2013]
, the authors proposed a gradientbased adversarial attack on kernel support vector machines (SVMs). More recently, gradientbased approaches are also used in adversarial attacks on image classifiers trained by DNNs
[Szegedy et al.2014, Goodfellow, Shlens, and Szegedy2015]. Due to space limitation, we focus on related work in adversarial attacks on DNNs. Interested readers may refer to the survey paper [Biggio and Roli2018] for more details.Gradientbased adversarial attacks on DNNs fall within the whitebox setting, since acquiring the gradient with respect to the input requires knowing the weights of the target DNN. In principle, adversarial attacks can be formulated as an optimization problem of minimizing the adversarial perturbation while ensuring attack objectives. In image classification, given a natural image, an untargeted attack aims to find a visually similar adversarial image resulting in a different class prediction, while a targeted attack aims to find an adversarial image leading to a specific class prediction. The visual similarity between a pair of adversarial and natural images is often measured by the norm of their difference, where . Existing powerful whitebox adversarial attacks using , or norms include iterative fast gradient sign methods [Kurakin, Goodfellow, and Bengio2017], Carlini and Wagner’s (C&W) attack [Carlini and Wagner2017b], elasticnet attacks to DNNs (EAD) [Chen et al.2018], etc.
Blackbox adversarial attacks are practical threats to the deployed machinelearning services. Attackers can observe the inputoutput correspondences of any queried input, but the target model parameters are completely hidden. Therefore, gradientbased adversarial attacks are inapplicable to a blackbox setting. As a first attempt, the authors in [Papernot et al.2017] proposed to train a substitute model using iterative model queries, perform whitebox attacks on the substitute model, and leverage the transferability of adversarial examples [Papernot, McDaniel, and Goodfellow2016, Liu et al.2017] to attack the target model. However, training a representative surrogate for a DNN is challenging due to the complicated and nonlinear classification rules of DNNs and high dimensionality of the underlying dataset. The performance of blackbox attacks can be severely degraded if the adversarial examples for the substitute model transfer poorly to the target model. To bridge this gap, the authors in [Chen et al.2017] proposed a blackbox attack called ZOO that directly estimates the gradient of the attack objective by iteratively querying the target model. Although ZOO achieves a similar attack success rate and comparable visual quality as many whitebox attack methods, it exploits the symmetric difference quotient method [Lax and Terrell2014] for coordinatewise gradient estimation and value update, which requires excessive target model evaluations and is hence not queryefficient. The same gradient estimation technique is also used in the later work in [Nitin Bhagoji et al.2018]. Although acceleration techniques such as importance sampling, bilinear scaling and random feature grouping have been used in [Chen et al.2017, Nitin Bhagoji et al.2018], the coordinatewise gradient estimation approach still forms a bottleneck for query efficiency.
Beyond optimizationbased approaches, the authors in [Ilyas et al.2018] proposed to use a natural evolution strategy (NES) to enhance query efficiency. Although there is also a vectorwise gradient estimation step in the NES attack, we treat it as an independent and parallel work since its natural evolutionary step is out of the scope of blackbox attacks using zerothorder gradient descent. We also note that different from NES, our AutoZOOM framework uses a queryefficient random gradient estimation strategy. In addition, AutoZOOM could be applied to further improve the query efficiency of NES, since NES does not take into account the factor of attack dimension reduction, which is the main focus of this paper. Under a more restricted setting, where only the decision (top1 prediction class) is known to an attacker, the authors in [Brendel, Rauber, and Bethge2018] proposed a randomwalk based attack around the decision boundary. Such a blackbox attack dispenses class prediction scores and hence requires additional model queries.
In this paper, we focus on improving the query efficiency of gradientestimation and gradientdescent based blackbox attacks and consider the threat model when the class prediction scores are known to an attacker. For reader’s reference, we compare existing blackbox attacks on DNNs with AutoZOOM in Table S1. One unique feature of AutoZOOM is the use of reduced attack dimension when mounting blackbox attacks, which is an unlabeled datadriven technique (autoencoder) for attack acceleration, and has not been studied thoroughly in existing attacks. While whitebox attacks such as [Baluja and Fischer2018]
have utilized autoencoders trained on the training data and the transparent logit representations of DNNs, we propose in this work to use autoencoders trained on unlabeled natural data to improve query efficiency for blackbox attacks.
There has been many methods proposed for defending adversarial attacks to DNNs. However, new defenses are continuously weakened by followup attacks [Carlini and Wagner2017a, Athalye, Carlini, and Wagner2018]. For instance, model ensembles [Tramèr et al.2018] were shown to be effective against some blackbox attacks, while they are recently circumvented by advanced attack techniques [Ilyas2018]. In this paper, we focus on improving query efficiency in attacking blackbox undefended DNNs.
Appendix B Proof of Theorem 1
Recall that the data dimension is and we assume to be differentiable and its gradient to be Lipschitz. Fixing and consider a smoothed version of :
(S1) 
Based on [Gao, Jiang, and Zhang2014, Lemma 4.1a], we have the relation
(S2) 
which then yields
(S3) 
where we recall that has been defined in (3). Moreover, based on [Gao, Jiang, and Zhang2014, Lemma 4.1b], we have
(S4) 
Substituting (S3) into (S4), we obtain
This then implies that
(S5) 
where
Once again, by applying [Gao, Jiang, and Zhang2014, Lemma 4.1b], we can easily obtain that
(S6) 
Now, let us consider the averaged random gradient estimator in (4),
Due to the properties of i.i.d. samples and (S5), we define
(S7) 
Moreover, we have
(S8)  
(S9)  
(S10) 
where we have used the fact that . The definition of in (S7) yields
(S11) 
From (S6), we also obtain that for any ,
(S12) 
Substituting (S11) and (S12) into (S10), we obtain
(S13)  
(S14) 
Finally, we bound the mean squared estimation error as
(S15) 
which completes the proof.
Appendix C Architectures of Convolutional Autoencoders in AutoZOOM
On MNIST, the convolutional autoencoder (CAE) is trained on 50,000 randomly selected handwritten digits from the MNIST8M dataset^{6}^{6}6http://leon.bottou.org/projects/infimnist. On CIFAR10, the CAE is trained on 9,900 images selected from its test dataset. The remaining images are used in blackbox attacks. On ImageNet, all the attacked natural images are from 10 randomly selected image labels, and these labels are also used as the candidate attack targets. The CAE is trained on about 9000 images from these classes.
Table S2 shows the architectures for all the autoencoders used in this work. Note that the autoencoders designed for ImageNet uses bilinear scaling to transform data size from to , and also back from to . This is to allow easy processing and handling for the autoencoder’s internal convolutional layers.
The normalized mean squared error of our autoencoder trained on MNIST, CIFAR10 and 25 Imagenet is 0.0027, 0.0049 and 0.0151, respectively, which lies within a reasonable range of compression loss.
Appendix D More Adversarial Examples of Attacking Inceptionv3 in the Blackbox Setting
Figure S1 shows other adversarial examples of the Inceptionv3 model in the blackbox targeted attack setting.
Appendix E Performance Evaluation of Blackbox Untargeted Attacks
Table S3 shows the attacking performance of blackbox untargeted attacks on MNIST, CIFAR10 and ImageNet using ZOO and AutoZOOMBiLIN attacks on the same set of images in Section 4.5. The Loss function is defined as
(S16) 
where is the top1 prediction label of a natural image . We set and use on MNIST and CIFAR10 and on ImageNet for distortion finetuning in the postattack phase. Comparing to Table 3, the number of model queries can be further reduced since untargeted attacks only require the adversarial images to be classified as any class other than rather than classified as a specific class .
Comments
There are no comments yet.