1 Introduction
Despite the remarkable success of machine learning approaches, especially deep neural networks (DNNs), in image classification, a series of results
Cheng et al. (2018); MoosaviDezfooli et al. (2016, 2015); Szegedy et al. (2013); Goodfellow et al. (2014) has raised concerns regarding their utility in safetycritical applications. Attackers can design humanimperceptible perturbations which when added to almost any natural image result in their misclassification. This perturbation design problem is usually posed as an optimization problem: given a fixed classifier and a correctly classified image , find a noise vector of minimal norm, such that, .There are a couple of issues with most of the current approaches developed to solve the above formulation. Most approaches assume a “whitebox" setting where the adversary has a complete knowledge of the ML model parameters. In such a setting, the gradient of the attack objective function can be computed by backpropagation, and the perturbation design problem can be solved quite easily. However, in many practical applications, model parameters (or internal design architecture) are inaccessible to the user Bhagoji et al. (2018) and only queries can be made to obtain the corresponding outputs of the model (e.g., probability score, hard label (or top predicted class), etc.). This gives rise to Security through obscurity (STO) defense paradigm. STO is a process of implementing security within a system by enforcing secrecy and confidentiality of the system’s internal design architecture. STO aims to secure a system by deliberately hiding or concealing its security flaws Van Oorschot (2003). Interestingly, some recent works have shown that even in such a “blackbox" setting, it is possible to fool the ML classifier with a high probability Chen et al. (2017b); Papernot et al. (2017); Bhagoji et al. (2017); Liu et al. (2018); Tu et al. (2018). These blackbox attacks can be broadly classified in two categories: knowledge transfer based attacks, and, zerothorder optimization based attacks. In knowledge transfer based attacks, instead of attacking the original model , attackers try to construct a substitute model to mimic and then attack it using existing whitebox attack methods Papernot et al. (2017); Bhagoji et al. (2017). However, it was shown recently that these approaches usually leads to much larger distortion and low success rate of attack transfer Chen et al. (2017b)
. To overcome this limitation, zerothorder optimization based attacks are devised which can be directly applied to minimize a suitable loss function for
using derivativefree optimization or blackbox optimization methods Chen et al. (2017b); Liu et al. (2018). In particular, Chen et al. (2017b)considered the problem of scorebased blackbox setting, where attackers can query the softmax layer output in addition to the final classification result. In this case, it is possible to reconstruct the original loss function and use a zeroth order optimization approach to optimize it. Most relevant to our work, the authors in
Brendel et al. (2017); Cheng et al. (2018) considered a hardlabel blackbox setting which refers to cases where realworld ML systems only provide limited prediction results of an input query. Specifically, only the final decision (top predicted label) instead of probability outputs is known to an attacker.However, compared to whitebox attacks, all the blackbox attacks discussed above are very computationally expensive (require millions of queries). Furthermore, these approaches are intrinsically dependent on the individual images: the perturbations are specifically crafted for each image independently. As a result, the computation of an adversarial perturbation for a new image requires solving an imagedependent optimization problem from scratch, making their applicability in practice infeasible^{1}^{1}1There are some efforts on designing imageagnostic universal perturbation, however, are limited to the “whitebox" setting MoosaviDezfooli et al. (2016); Poursaeed et al. (2017); Perolat et al. (2018); Brown et al. (2017); Mopuri et al. (2018). . This practical limitation gives us an impression that Security through obscurity (STO) defense may still be a viable solution. To evaluate this hypothesis, this paper considers the problem of finding a Universal Adversarial Perturbation a vector which can be added to any image to fool the classifier with high probability, in a hardlabel backbox setting. Such an would eliminate the need to recompute a new perturbation for each input. Note that the hardlabel blackbox setting is very challenging as it requires minimizing a noncontinuous step function, which is combinatorial and cannot be solved by a gradientbased optimizer. The main contributions of this paper are as follows:

We show the existence of universal adversarial perturbations for ML classifiers in a hardlabel blackbox setting (breaking the STO defense).

We reformulate the attack as an easy to solve continuousvalued optimization problem and propose a zerothorder optimization algorithm for finding such perturbations.

Experimental validations are performed on CIFAR10 dataset.
2 Universal HardLabel BlackBox Attacks
We formalize in this section the notion of universal perturbations in a hardlabel blackbox setting, and propose a method for finding such perturbations. For simplicity, let us consider attacking a way multiclass classification model in this paper, i.e., . Finding a universal adversarial perturbation can be posed as a stochastic optimization problem: find of minimal norm such that
(1) 
where is the set of natural images. The main focus of this paper is to design approaches to find an imageagnostic minimal norm (or quasiimperceptible) perturbation vector that fool the classifier on almost all images sampled from with only hardlabeled queries from . The main challenge now is to solve (1) with only hardlabeled queries to .
Note that the optimization problem (1) is extremely difficult to approach directly. Not only is the gradient with respect to unavailable, but also the loss function ( indicator function of the set of satisfying equation (1
)) is discontinuous. Furthermore, the loss function cannot be evaluated directly, and can only be estimated with empirical sampling. To overcome this challenge, we next introduce an auxiliary function which is much easier to optimize. Furthermore, we show that to optimize (
1), it suffices to optimize the auxiliary function.2.1 A Universal Auxiliary Function
Now we reformulate universal hardlabel blackbox attack as another easier to optimize problem by defining a universal auxiliary objective function. Later we will discuss how to evaluate the function value using hardlabel queries, and then apply a zeroth order optimization algorithm to obtain a universal adversarial perturbation.
Let us consider the set of natural images in the ambient space , where each has true label , and the hardlabel blackbox function . Following a similar approach as given in Cheng et al. (2018), we define our new objective function^{2}^{2}2
Note that under the manifold hypothesis
is continuous almost everywhere, and is therefore amenable to zerothorder optimization methods. as:(2) 
In this formulation, represents the universal search direction and is the distance from to the nearest universal adversarial perturbation along the direction . In this formulation, instead of directly searching for a universal perturbation , we search for the universal direction to minimize the distortion , which leads to the following optimization problem:
(3) 
and in this case the universal perturbation vector and adversarial examples are given by
(4) 
Note that unlike whitebox objective functions, which are discontinuous step functions in the hardlabel setting, maps input direction to realvalued output (distance to decision boundary), which is usually a continuous function. This makes it easy to apply zerothorder optimization methods to solve universal perturbation design problem. In the absence of the knowledge of the image generating distribution, the most natural choice is to empirically estimate as follows: take a sample set of images and define
(5) 
In other words, is defined as the minimal perturbation distance in direction so that all images are misclassified.
2.2 Algorithms to Find Universal Perturbation
Even with our new auxiliary function defined, we cannot evaluate the gradient of due to the blackbox nature of the problem. However, now we can evaluate the function values of (which are continuous) using the hardlabel queries to the original classifier function . This procedure is used to find the initial and corresponding in our optimization algorithm. For a given normalized , we do a finegrained search and then a binary search similar to Cheng et al. (2018). We omit the detailed algorithm for this part since it is similar to Algorithm 1.
In practice, the optimization of
turns out to be difficult to minimize. Therefore, in our implementation, we consider a couple of variants of as discussed next next. First, we consider a norm based approximation:
(6) 
where is obtained by applying on a single image (instead of a set of images ). We call such an attack a “NormAttack", as we are trying to minimize the norm of a vector with in each component. Next, we consider a stochastic approximation as given below:
(7) 
We refer to this attack as the “ProbAttack" as it aims to find the minimal perturbation needed to fool a proportion of the training samples.
Given and , to solve the universal adversarial perturbation design problem for which we can only evaluate function value instead of gradient, zerothorder optimization algorithms can be naturally applied. In this paper, we use Randomized GradientFree (RGF) method proposed in Balasubramanian and Ghadimi (2018) as our zerothorder algorithm. In each iteration, the gradient of function is estimated by
where is a random Gaussian vector, and is a smoothing parameter.
3 Experimental Results
We test the performance of our universal hardlabel blackbox attack algorithms on convolutional neural network (CNN) models. We validate our approach on CIFAR10 dataset. The network model used is as follows: four convolution layers, two maxpooling layers and two fullyconnected layers. Using the parameters provided by
Carlini and Wagner (2017), we could achieveaccuracy on CIFAR10. All models are trained using Pytorch and our source code will be publicly available soon.
3.1 Breaking the STO Defense
In the first experiment, we analyze the robustness of deep neural network classifiers to blackbox universal perturbations found using Algorithm 2 on CIFAR dataset. Specifically, we report the fooling ratio, that is the proportion of images that change labels when perturbed by our blackbox universal perturbation. It can be seen from Fig. 1 that Algorithm 2 can find quasiimperceptible perturbations which can fool DNN classifiers with a very high probability. Specifically, “NormAttack" based blackbox universal perturbation achieves high fooling/success rate with very small distortion. By increasing the magnitude of the distortion, we can achieve success rate as in “whitebox" attacks. We also show corresponding perturbed images for visual inspection of quasiimperceptibility in Fig 2. These results show that even without accessing the model parameters, an adversary can fool DNN classifiers with only relying on hardlabel queries to the blackbox. As a consequence, STO based defenses are not robust for ML applications.
4 Conclusion
In this paper, we showed the existence of hardlabel blackbox universal perturbations that can fool stateoftheart classifiers on natural images. We proposed an iterative algorithm to generate universal perturbations without accessing the model parameters. In particular, we showed that these universal perturbations can be easily found using hardlabeled queries to ML blackbox models, thereby, breaking the security through obscurity based defenses. Currently, we are devising techniques to utilize gradient information from whitebox models (or knowledge transfer) to minimize the querycomplexity of finding such hardlabel blackbox universal perturbations. Also, we plan to show that these universal perturbations generalize very well across different ML models resulting in doublyuniversal perturbations (imageagnostic, networkagnostic). A theoretical analysis of existence of blackbox universal perturbations will be the subject of future research.
5 Acknowledgements
Thanks to NSF Mathematical Sciences Research Program for financial support for this research. Bhavya Kailkhura’s work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DEAC5207NA27344 (LLNLCONF761205DRAFT). Thomas Hogan would like to express gratitude to Lawrence Livermore National Lab, in particular to Bhavya Kailkhura and Ryan Goldhahn, for their hospitality during his internship where this research was conducted.
References
 Balasubramanian and Ghadimi [2018] Krishnakumar Balasubramanian and Saeed Ghadimi. Zerothorder (non)convex stochastic optimization via conditional gradient and gradient updates. arXiv preprint arXiv:1809.06474, 2018.
 Bhagoji et al. [2017] Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Exploring the space of blackbox attacks on deep neural networks. arXiv preprint arXiv:1712.09491, 2017.
 Bhagoji et al. [2018] Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Blackbox attacks on deep neural networks via gradient estimation. ICLR Workshop, 2018.
 Brendel et al. [2017] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. arXiv preprint arXiv:1712.04248, 2017.
 Brown et al. [2017] Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch. arXiv preprint arXiv:1712.09665, 2017.
 Carlini and Wagner [2017] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017.
 Chen et al. [2017a] PinYu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and ChoJui Hsieh. Ead: elasticnet attacks to deep neural networks via adversarial examples. arXiv preprint arXiv:1709.04114, 2017a.

Chen et al. [2017b]
PinYu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and ChoJui Hsieh.
Zoo: Zeroth order optimization based blackbox attacks to deep neural
networks without training substitute models.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pages 15–26. ACM, 2017b.  Cheng et al. [2018] M. Cheng, T. Le, P.Y. Chen, J. Yi, H. Zhang, and C.J. Hsieh. QueryEfficient Hardlabel Blackbox Attack:An Optimizationbased Approach. ArXiv eprints, July 2018.
 Goodfellow et al. [2014] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. ArXiv eprints, December 2014.
 Kurakin et al. [2016] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 Liu et al. [2018] Sijia Liu, Bhavya Kailkhura, PinYu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zerothorder stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1805.10367, 2018.
 MoosaviDezfooli et al. [2015] S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard. DeepFool: a simple and accurate method to fool deep neural networks. ArXiv eprints, November 2015.
 MoosaviDezfooli et al. [2016] S.M. MoosaviDezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. ArXiv eprints, October 2016.
 Mopuri et al. [2018] Konda Reddy Mopuri, Aditya Ganeshan, and R Venkatesh Babu. Generalizable datafree objective for crafting universal adversarial perturbations. arXiv preprint arXiv:1801.08092, 2018.
 Papernot et al. [2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
 Perolat et al. [2018] Julien Perolat, Mateusz Malinowski, Bilal Piot, and Olivier Pietquin. Playing the game of universal adversarial perturbations. arXiv preprint arXiv:1809.07802, 2018.
 Poursaeed et al. [2017] Omid Poursaeed, Isay Katsman, Bicheng Gao, and Serge Belongie. Generative adversarial perturbations. arXiv preprint arXiv:1712.02328, 2017.
 Szegedy et al. [2013] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ArXiv eprints, December 2013.
 Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Tu et al. [2018] ChunChen Tu, Paishun Ting, PinYu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, ChoJui Hsieh, and ShinMing Cheng. Autozoom: Autoencoderbased zeroth order optimization method for attacking blackbox neural networks. arXiv preprint arXiv:1805.11770, 2018.
 Van Oorschot [2003] Paul C Van Oorschot. Revisiting software protection. In International Conference on Information Security, pages 1–13. Springer, 2003.
Comments
There are no comments yet.