Universal Decision-Based Black-Box Perturbations: Breaking Security-Through-Obscurity Defenses

11/09/2018 ∙ by Thomas A. Hogan, et al. ∙ Lawrence Livermore National Laboratory University of California-Davis 0

We study the problem of finding a universal (image-agnostic) perturbation to fool machine learning (ML) classifiers (e.g., neural nets, decision tress) in the hard-label black-box setting. Recent work in adversarial ML in the white-box setting (model parameters are known) has shown that many state-of-the-art image classifiers are vulnerable to universal adversarial perturbations: a fixed human-imperceptible perturbation that, when added to any image, causes it to be misclassified with high probability Kurakin et al. [2016], Szegedy et al. [2013], Chen et al. [2017a], Carlini and Wagner [2017]. This paper considers a more practical and challenging problem of finding such universal perturbations in an obscure (or black-box) setting. More specifically, we use zeroth order optimization algorithms to find such a universal adversarial perturbation when no model information is revealed-except that the attacker can make queries to probe the classifier. We further relax the assumption that the output of a query is continuous valued confidence scores for all the classes and consider the case where the output is a hard-label decision. Surprisingly, we found that even in these extremely obscure regimes, state-of-the-art ML classifiers can be fooled with a very high probability just by adding a single human-imperceptible image perturbation to any natural image. The surprising existence of universal perturbations in a hard-label black-box setting raises serious security concerns with the existence of a universal noise vector that adversaries can possibly exploit to break a classifier on most natural images.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the remarkable success of machine learning approaches, especially deep neural networks (DNNs), in image classification, a series of results

Cheng et al. (2018); Moosavi-Dezfooli et al. (2016, 2015); Szegedy et al. (2013); Goodfellow et al. (2014) has raised concerns regarding their utility in safety-critical applications. Attackers can design human-imperceptible perturbations which when added to almost any natural image result in their misclassification. This perturbation design problem is usually posed as an optimization problem: given a fixed classifier and a correctly classified image , find a noise vector of minimal norm, such that, .

There are a couple of issues with most of the current approaches developed to solve the above formulation. Most approaches assume a “white-box" setting where the adversary has a complete knowledge of the ML model parameters. In such a setting, the gradient of the attack objective function can be computed by back-propagation, and the perturbation design problem can be solved quite easily. However, in many practical applications, model parameters (or internal design architecture) are inaccessible to the user Bhagoji et al. (2018) and only queries can be made to obtain the corresponding outputs of the model (e.g., probability score, hard label (or top- predicted class), etc.). This gives rise to Security through obscurity (STO) defense paradigm. STO is a process of implementing security within a system by enforcing secrecy and confidentiality of the system’s internal design architecture. STO aims to secure a system by deliberately hiding or concealing its security flaws Van Oorschot (2003). Interestingly, some recent works have shown that even in such a “black-box" setting, it is possible to fool the ML classifier with a high probability Chen et al. (2017b); Papernot et al. (2017); Bhagoji et al. (2017); Liu et al. (2018); Tu et al. (2018). These black-box attacks can be broadly classified in two categories: knowledge transfer based attacks, and, zeroth-order optimization based attacks. In knowledge transfer based attacks, instead of attacking the original model , attackers try to construct a substitute model to mimic and then attack it using existing white-box attack methods Papernot et al. (2017); Bhagoji et al. (2017). However, it was shown recently that these approaches usually leads to much larger distortion and low success rate of attack transfer Chen et al. (2017b)

. To overcome this limitation, zeroth-order optimization based attacks are devised which can be directly applied to minimize a suitable loss function for

using derivative-free optimization or black-box optimization methods Chen et al. (2017b); Liu et al. (2018). In particular, Chen et al. (2017b)

considered the problem of score-based black-box setting, where attackers can query the softmax layer output in addition to the final classification result. In this case, it is possible to reconstruct the original loss function and use a zeroth order optimization approach to optimize it. Most relevant to our work, the authors in 

Brendel et al. (2017); Cheng et al. (2018) considered a hard-label black-box setting which refers to cases where real-world ML systems only provide limited prediction results of an input query. Specifically, only the final decision (top- predicted label) instead of probability outputs is known to an attacker.

However, compared to white-box attacks, all the black-box attacks discussed above are very computationally expensive (require millions of queries). Furthermore, these approaches are intrinsically dependent on the individual images: the perturbations are specifically crafted for each image independently. As a result, the computation of an adversarial perturbation for a new image requires solving an image-dependent optimization problem from scratch, making their applicability in practice infeasible111There are some efforts on designing image-agnostic universal perturbation, however, are limited to the “white-box" setting Moosavi-Dezfooli et al. (2016); Poursaeed et al. (2017); Perolat et al. (2018); Brown et al. (2017); Mopuri et al. (2018). . This practical limitation gives us an impression that Security through obscurity (STO) defense may still be a viable solution. To evaluate this hypothesis, this paper considers the problem of finding a Universal Adversarial Perturbation- a vector which can be added to any image to fool the classifier with high probability, in a hard-label back-box setting. Such an would eliminate the need to recompute a new perturbation for each input. Note that the hard-label black-box setting is very challenging as it requires minimizing a non-continuous step function, which is combinatorial and cannot be solved by a gradient-based optimizer. The main contributions of this paper are as follows:

  • We show the existence of universal adversarial perturbations for ML classifiers in a hard-label black-box setting (breaking the STO defense).

  • We reformulate the attack as an easy to solve continuous-valued optimization problem and propose a zeroth-order optimization algorithm for finding such perturbations.

  • Experimental validations are performed on CIFAR10 dataset.

2 Universal Hard-Label Black-Box Attacks

We formalize in this section the notion of universal perturbations in a hard-label black-box setting, and propose a method for finding such perturbations. For simplicity, let us consider attacking a -way multi-class classification model in this paper, i.e., . Finding a universal adversarial perturbation can be posed as a stochastic optimization problem: find of minimal norm such that


where is the set of natural images. The main focus of this paper is to design approaches to find an image-agnostic minimal norm (or quasi-imperceptible) perturbation vector that fool the classifier on almost all images sampled from with only hard-labeled queries from . The main challenge now is to solve (1) with only hard-labeled queries to .

Note that the optimization problem (1) is extremely difficult to approach directly. Not only is the gradient with respect to unavailable, but also the loss function ( indicator function of the set of satisfying equation (1

)) is discontinuous. Furthermore, the loss function cannot be evaluated directly, and can only be estimated with empirical sampling. To overcome this challenge, we next introduce an auxiliary function which is much easier to optimize. Furthermore, we show that to optimize (

1), it suffices to optimize the auxiliary function.

2.1 A Universal Auxiliary Function

Now we re-formulate universal hard-label black-box attack as another easier to optimize problem by defining a universal auxiliary objective function. Later we will discuss how to evaluate the function value using hard-label queries, and then apply a zeroth order optimization algorithm to obtain a universal adversarial perturbation.

Let us consider the set of natural images in the ambient space , where each has true label , and the hard-label black-box function . Following a similar approach as given in Cheng et al. (2018), we define our new objective function222

Note that under the manifold hypothesis

is continuous almost everywhere, and is therefore amenable to zeroth-order optimization methods. as:


In this formulation, represents the universal search direction and is the distance from to the nearest universal adversarial perturbation along the direction . In this formulation, instead of directly searching for a universal perturbation , we search for the universal direction to minimize the distortion , which leads to the following optimization problem:


and in this case the universal perturbation vector and adversarial examples are given by


Note that unlike white-box objective functions, which are discontinuous step functions in the hard-label setting, maps input direction to real-valued output (distance to decision boundary), which is usually a continuous function. This makes it easy to apply zeroth-order optimization methods to solve universal perturbation design problem. In the absence of the knowledge of the image generating distribution, the most natural choice is to empirically estimate as follows: take a sample set of images and define


In other words, is defined as the minimal perturbation distance in direction so that all images are misclassified.

2.2 Algorithms to Find Universal Perturbation

Even with our new auxiliary function defined, we cannot evaluate the gradient of due to the black-box nature of the problem. However, now we can evaluate the function values of (which are continuous) using the hard-label queries to the original classifier function . This procedure is used to find the initial and corresponding in our optimization algorithm. For a given normalized , we do a fine-grained search and then a binary search similar to Cheng et al. (2018). We omit the detailed algorithm for this part since it is similar to Algorithm 1.

  Input: Hard-label model , original image label pairs , query direction , previous value , increase/decrease ratio , stopping tolerance (maximum tolerance of computed error)
  if  for some  then
     while  for some  do
     end while
     while  for all  do
     end while
  end if
  ## Binary search within
  while  do
     if  for some  then
     end if
  end while
Algorithm 1 Compute locally
  Input: Hard-label model , original image label pairs , initial .
  for  do
     Randomly choose

from a zero-mean Gaussian distribution

     Evaluate and using Algorithm 1
  end for
Algorithm 2 RGF for universal hard-label black-box attack (See also Cheng et al. (2018))

In practice, the optimization of

turns out to be difficult to minimize. Therefore, in our implementation, we consider a couple of variants of as discussed next next. First, we consider a norm based approximation:


where is obtained by applying on a single image (instead of a set of images ). We call such an attack a “NormAttack", as we are trying to minimize the norm of a vector with in each component. Next, we consider a stochastic approximation as given below:


We refer to this attack as the “ProbAttack" as it aims to find the minimal perturbation needed to fool a proportion of the training samples.

Given and , to solve the universal adversarial perturbation design problem for which we can only evaluate function value instead of gradient, zeroth-order optimization algorithms can be naturally applied. In this paper, we use Randomized Gradient-Free (RGF) method proposed in Balasubramanian and Ghadimi (2018) as our zeroth-order algorithm. In each iteration, the gradient of function is estimated by

where is a random Gaussian vector, and is a smoothing parameter.

3 Experimental Results

We test the performance of our universal hard-label black-box attack algorithms on convolutional neural network (CNN) models. We validate our approach on CIFAR-10 dataset. The network model used is as follows: four convolution layers, two max-pooling layers and two fully-connected layers. Using the parameters provided by 

Carlini and Wagner (2017), we could achieve

accuracy on CIFAR-10. All models are trained using Pytorch and our source code will be publicly available soon.

3.1 Breaking the STO Defense

Figure 1: Success rate of universal adversarial perturbations at various scales on random images from CIFAR-10 dataset. The relative distortion is norm of perturbation divided by average norm of CIFAR-10 image. Since CIFAR-10 has only image classes, success rate is optimal for untargeted attack as it indicates that the classifier is essentially returning a random prediction.
Figure 2: Bottom row is perturbation scaled to various relative distortion levels. Other rows show perturbation applied to various images in CIFAR-10 images and predicted class.

In the first experiment, we analyze the robustness of deep neural network classifiers to black-box universal perturbations found using Algorithm 2 on CIFAR dataset. Specifically, we report the fooling ratio, that is the proportion of images that change labels when perturbed by our black-box universal perturbation. It can be seen from Fig. 1 that Algorithm 2 can find quasi-imperceptible perturbations which can fool DNN classifiers with a very high probability. Specifically, “NormAttack" based black-box universal perturbation achieves high fooling/success rate with very small distortion. By increasing the magnitude of the distortion, we can achieve success rate as in “white-box" attacks. We also show corresponding perturbed images for visual inspection of quasi-imperceptibility in Fig 2. These results show that even without accessing the model parameters, an adversary can fool DNN classifiers with only relying on hard-label queries to the black-box. As a consequence, STO based defenses are not robust for ML applications.

4 Conclusion

In this paper, we showed the existence of hard-label black-box universal perturbations that can fool state-of-the-art classifiers on natural images. We proposed an iterative algorithm to generate universal perturbations without accessing the model parameters. In particular, we showed that these universal perturbations can be easily found using hard-labeled queries to ML black-box models, thereby, breaking the security through obscurity based defenses. Currently, we are devising techniques to utilize gradient information from white-box models (or knowledge transfer) to minimize the query-complexity of finding such hard-label black-box universal perturbations. Also, we plan to show that these universal perturbations generalize very well across different ML models resulting in doubly-universal perturbations (image-agnostic, network-agnostic). A theoretical analysis of existence of black-box universal perturbations will be the subject of future research.

5 Acknowledgements

Thanks to NSF Mathematical Sciences Research Program for financial support for this research. Bhavya Kailkhura’s work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-761205-DRAFT). Thomas Hogan would like to express gratitude to Lawrence Livermore National Lab, in particular to Bhavya Kailkhura and Ryan Goldhahn, for their hospitality during his internship where this research was conducted.