A Study of Black Box Adversarial Attacks in Computer Vision

Machine learning has seen tremendous advances in the past few years which has lead to deep learning models being deployed in varied applications of day-to-day life. Attacks on such models using perturbations, particularly in real-life scenarios, pose a serious challenge to their applicability, pushing research into the direction which aims to enhance the robustness of these models. After the introduction of these perturbations by Szegedy et al., significant amount of research has focused on the reliability of such models, primarily in two aspects - white-box, where the adversary has access to the targeted model and related parameters; and the black-box, which resembles a real-life scenario with the adversary having almost no knowledge of the model to be attacked. We propose to attract attention on the latter scenario and thus, present a comprehensive comparative study among the different adversarial black-box attack approaches proposed till date. The second half of this literature survey focuses on the defense techniques. This is the first study, to the best of our knowledge, that specifically focuses on the black-box setting to motivate future work on the same.


page 1

page 2

page 3

page 4


A note on hyperparameters in black-box adversarial examples

Since Biggio et al. (2013) and Szegedy et al. (2013) first drew attentio...

Black-box Adversarial Attacks on Video Recognition Models

Deep neural networks (DNNs) are known for their vulnerability to adversa...

Leveraging Extracted Model Adversaries for Improved Black Box Attacks

We present a method for adversarial input generation against black box m...

Data Poisoning Won't Save You From Facial Recognition

Data poisoning has been proposed as a compelling defense against facial ...

Interactive Machine Learning: A State of the Art Review

Machine learning has proved useful in many software disciplines, includi...

Black-box Adversarial Attacks on Network-wide Multi-step Traffic State Prediction Models

Traffic state prediction is necessary for many Intelligent Transportatio...

Integrity Fingerprinting of DNN with Double Black-box Design and Verification

Cloud-enabled Machine Learning as a Service (MLaaS) has shown enormous p...

1. Introduction

Machine Learning (ML) is the most influential field in sciences today. With applications in Computer Science, Biomedical Engineering, Finance, and Law leveraging the edge in Language, Vision, Audio with specialized branches like Deep Learning (DL), Reinforcement Learning (RL) the involvement of ML in day-to-day life is undeniable. RL models are beating world champions at their own game, DL models can classify a multitude of categories with precise identification of species, objects, material which is humanly impossible. The field has grown so much that Deep Language models are writing fiction, news reports, and poetry at the same time. While the developments in the field are exciting, the underlying techniques are probabilistic making the performance of an ML model highly dependent on the training data.

Gaps in the training data has lead to biased models, a biased model can have excellent performance on a set of inputs closer to the training data but it fails on corner cases where the input was rarely seen while training. Whatever be the domain, an ML model draws boundaries between classes in an dimensional space, and these decision boundaries can not be reasoned. Sometimes these boundaries can be subtle, making the model categorize inputs into different classes (with high confidence scores) in the immediate neighborhood of a boundary. Taking advantage of these traits in ML, to fool the system into wrongly interpreting an input, Adversarial attacks are crafted.

A typical Adversarial attack starts with studying the model, inputs, outputs, training data, model parameters etc. according to the threat model. While, earlier threat models were surrounded around the training data and complete access to the model parameters, it is less practical in a sense that the attacker can be an outsider. Black Box adversarial attack on the other hand limits the attacker’s capabilities, with access only to the deployed model, sometimes restrictions are also put on the observable parameters of the model. The setting of Black Box attacks and constraints on attacker are closer to the real world scenario.

To understand the Adversarial attacks in the context of Black-box in this section we introduce the Taxonomy.

1.1. Taxonomy

Here we introduce the key concepts in the Adversarial attacks context.

  • Adversary: An entity trying to attack the Machine Learning model to wrongly classify a perfectly legitimate looking input. Say input , where c is the original class, Adversary’s goal is to make the model predict an input , where is the input adversary carefully crafts and the point to note is will still be classified as for a human annotator.

  • Perturbation: To achieve the above mentioned goals Adversary crafts an input which is perturbed from by adding an impurity say . Modified input is called the perturbed input. Constraints on the perturbation are a challenge to the Adversary. Any changes made should be under a certain value, to make sure that the input is still classified into its original class by a human annotator. distance is often used in literature to define the acceptable impurity.

  • Adversarial Input: Input crafted by Adversary after perturbation is called Adversarial Input. Key properties of this input is the impurity and actual input. Addition of this impurity makes the actual input "jump the classification boundary", hence the model mis classifies an Adversarial Input.

  • Targeted Attack: When the target mis-classified class of a perturbed input is defined then it is a Targeted Attack. Here, the adversary targets the impurity in such a way that the model classifies given input to class .

  • Query: A query is counted as a single instance of sending input the the model under attack and noting the observations made. Minimizing the number of queries made reduces the time taken to build an adversarial example, hence it is a key aspect of building efficient adversary models.

  • Threat Model: Threat model defines the rules of the attack, what resources can the adversary use and end goal of the attack. Two main components are described as below:

    • Attacker Goals: Goals define what the adversarial input seeks from the attack. Various security goals such as integrity attack, availability attack, targeted attack, exploratory attack come under goals. An attacker can attempt a single goal or multiple at a given time.

    • Attacker Capabilities: Information at adversary’s disposal like the training data (with or without labels), model parameters, number of queries adversary can make come under capabilities of the attacker. Also, training time attack and post-training attack are another aspect in which attacker’s capabilities are defined.

  • White-box Attack: In a white-box attack the adversary knows everything about the target model. The knowledge of adversary includes the learned weights, parameters used to tune the model. Labelled training data is also available in some cases. With this information the usual strategy of attacker is to model the distribution from weights and derive perturbed inputs to violate the boundaries while being in the limits.

  • Black-box Attack: Contrast to white-box attacks a black-box attack has limited knowledge of the model, with no labelled data sometimes. An attack with black-box constraints is often modeled around querying the model on inputs, observing the labels or confidence scores.

  • Transferability: It refers to the occasion of using a model trained on one the same of data, by observing inputs and outputs (querying) of the original model called "substitute model", and using the perturbed inputs from this model to attack the original "target model". Assumption here is that the substitute model simulates the target model.

1.2. Related Surveys, how our survey is different

In an early work (barreno2010security), the authors introduced the now-standard taxonomy of Adversarial attacks in Machine Learning, the survey gave a structure to the interaction between attacker and defender. By fitting the existing adversarial attacks into this threat model, authors generalized on the principles involved in the security of Machine Learning. Another major contribution of the work is a line of defences to Adversarial attacks, that the authors presented based on their taxonomy. The work (gilmer2018motivating) also generalised and standardised the work done in this domain.

As the adversarial attacks became more frequent, later surveys in the field concentrated in specifics of domain. (akhtar2018threat) is an extensive survey on attacks in Computer Vision with Deep Learning, (liu2018survey) gave a data driven view of the same, and (kumar2017survey) studied the fundamental security policy violations in adversarial attacks.

As mentioned in Section 1, Black Box adversarial attacks are more close to real life and a lot of work is being carried out in this domain. However extensive and informative the surveys above have been, there is a lack of study exclusive to Black Box attacks. With this work we want to survey the existing bench mark Black Box attacks on different domains. We also address how each of the domains like Computer Vision, NLP differ in terms of robustness towards attack and the challenges in approaching these models. Table 1 shows a list of works we discuss in this paper, discussing the milestone contributions in the Black Box Adversarial space.

1.3. Why black box

While black-box attacks limit the capabilities of the attacker, they are more practical. In a general sense security attacks are carried out on full grown and deployed systems. Attacks with intent to circumvent a system, disable a system, compromise the integrity are seen in real world.

Consideration of two parties, adversary and challenger is possible in this paradigm. Challenger being the party that trains model and deploys and adversary being the attacker whose intention is to break the system for a predefined goal. Multiple capability configurations which reflect real world behavior are possible in this setting. As an example if a robber evading from the crime scene wants to fool the surveillance system into tagging a different car number, or a criminal targeting the prey’s car number getting tagged are configuration changes to adversarial user goals. However, the system under attack in this example is owned by a third party. Neither the robber nor the criminal were part of the development of the surveillance system.

The limited capabilities of adversary are just a reflection of real world scenarios. Hence, works in this domain are more practical.

1.4. Evaluation of Attack

Adversarial attacks in general are evaluated based on the number of queries made on the model to converge on parameters. Lesser the number of queries better will be the attack, as time taken to start an attack will be minimized. Apart from time taken to craft an attack, perturbation norm is a standard measure of determining effectiveness of an attack. Borrowed from Machine Learning literature, the most common perturbation norms in user are and

, which are standard error measures.

  • norm: Also called as Euclidean norm,

    is the shortest distance between two vectors. In Adversarial setting, the distance from original input

    and the perturbed input is calculated. With an aim to maintain the distance at minimum, the goal is to change an input in a way indistinguishable by a human annotator.

  • norm: It is the highest entry in the vector space. essentially determines the maximum value in the vector given.

Month Year Method Proposed Approach
Feb 2016 Papernot et al.(papernot) Local substitute model
Dec 2016 Narodytska et al.(narodytska_2016) Gradient-free
Jul 2017 Narodytska et al.(narodytska_2017) Advanced Local Search
Nov 2017 Chen et al.(chen_2017) ZOO
Feb 2018 Brendel et al.(brendel) Boundary Attack
Jul 2018 Ilyas et al.(ilyas_2018) Limited Queries and Info
Jul 2018 Cheng et al.(cheng) Opt-Attack
Oct 2018 Bhagoji et al.(bhagoji) Query Reduction using Finite Differences
Oct 2018 Du et al.(du) Attack using Gray Images
Jan 2019 Tu et al.(tu) AutoZOOM
Apr 2019 Shi et al.(shi) Curls & Whey
Apr 2019 Dong et al.(dong) Translation-Invariant
May 2019 Chen et al.(chen_2019) POBA-GA
May 2019 Brunner et al.(brunner) Biased Sampling
May 2019 Li et al.(li) NATTACK
May 2019 Moon et al.(moon) Combinatorial Optimization
May 2019 Ilyas et al.(ilyas_2019) Bandits &Priors
Jul 2019 Alzantot et al.(alzantot) GenAttacks
Aug 2019 Guo et al.(guo) SimBA
Table 1. A comparison of works that have focused on black-box scenarios in an adversarial setting.

2. Black Box vs White Box

As discussed in Taxonomy 1.1, resources with an adversary differ considerably based on the type of attack model. In a White-box attack an adversary can have full access to the model right from the start of training, which leads to the availability of training data, testing data, network architecture, parameters, and finally the learned weights of a model etc. Also, the number of queries an adversary can make comes under resources. For example in a one-shot adversarial attack the attacker has only a single shot at fooling the model, in such cases crafting the adversarial example does not involve a fine-tuning step.

When it comes to Black-box attacks the resources with adversary are considerably less. For one, the adversary will have no access to the model in training phase, nor the adversary knows weights and parameters used in the model. However, once the model is deployed, based on the type of attack the probability of each predicted label will be provided to the adversary. A stricter case is where only the predicted label is known to the attacker without any confidence score. Varying degree of information like labelled dataset are provided to adversaries in certain threat models. Again, as said above number of queries an attacker can make while crafting an adversarial example is considered under resources. An attack is considered superior if the amount of resources consumed is minimal.

3. Crafting a Black Box attack

A Black Box adversarial example starts with evaluating evaluating access to the resources at attacker’s disposal. These include the kind of model, any model related parameters such as confidence scores, test set, training dataset etc. Once the resources are in place. An attack is carried out in one of the following ways. Using a transferable attack strategy, the adversary can chose to train a parallel model called a substitute model

to emulate the original model. Attacker can use a much superior architecture than the original for the weight estimation. However, in the absence of a substitute model attacker chooses to have a

Query Feedback mechanism, where attacker continuously crafts the perturbed input while querying on the model under attack.

As mentioned earlier, when a network’s architecture and model weights are unknown, which is the case with Black-box adversarial attacks, it is common practise for attackers to train another model from scratch, called a substitute model with a goal to emulate the model under attack. Once the substitute model with attacker has achieved satisfactory accuracy, by examining the weights adversarial examples are crafted. As the perturbed inputs start mis-classified by the model at attacker’s hand, these inputs are used on actual model to achieve the adversarial goals. The choice for a substitute model is usually kept at a much superior level so as to understand the latent space with superior contrast. As the adversary has much control and knowledge of the task at hand, this is the choice for one-shot and targeted adversarial attacks. To put in a sentence: Transferring the learned weights by continuous querying, and strategically attacking the model comes under transferable attacks.

Adversarial attacks Without using a substitute model pose the challenge of identifying the boundaries, crafting the perturbed input at a manual level. Rather than training a model, then examining the weights this kind of attack is more hands on and simultaneous. The attack takes a query feedback form, where the attacker starts with a random input and starts adding noise to the input under the acceptable perturbation error level. As the confidence score of input deteriorates, attacker further pushes the noise in the direction of noise, more like following gradient. In some cases gradient descent itself is used to mark the direction and track the movement of noise in the right direction. Also, termed as local search the technique boils down to searching the right dimension in the latent space to get a mis-classified input. Detailed discussion of attacks following both the substitute model and query feedback mechanism have been the pillars of Black Box adversarial attacks. Section to follow discuss the same in a great detail, Table 4.0.1 shows the works based on domain that we cover in this paper.

4. Black Box Attacks in Computer Vision

This section provides a detailed discussion of recent works, and important works in the Black Box adversarial attacks domain. Subsections are organized in the order of early works appearing first. In these sections we cover the dataset used, key contributions, loss function used to calculate perturbations are discussed.

4.0.1. Popular Image data sets

The most frequently used datasets to demonstrate Adversarial attacks are as follows:

  • MNIST: A popular image dataset of black and white, handwritten digits. It has 10 classes of digits from 0 to 9. This is considered as an entry level dataset.

  • CIFAR 10: The dataset consists of 60,000 color images in 10 classes. This dataset is large and diverse in style compared to MNIST.

  • ImageNet: With 14 million images across 27 high level categories and 21,841 sub level categories. This dataset played an important in computer vision, and it is an advanced dataset.

for tree= myleaf/.style=label=[align=left]below:#1, s sep=1cm [Black-Box Adversarial Attacks,rectangle,rounded corners,draw [Gradient Estimation,rectangle,rounded corners,draw,align=center, myleaf= Chen et al.(chen_2017)

Ilyas et al.(ilyas_2018)

Cheng et al.(cheng)

Bhagoji et al.(bhagoji)

Du et al.(du)

Tu et al.(tu)

Ilyas et al.(ilyas_2019) ] [Transferability,rectangle,rounded corners,draw,align=center, myleaf= Papernot et al.(papernot)

Shi et al.(shi)

Dong et al.(dong) ] [Local Search,rectangle,rounded corners,draw,align=center, myleaf= Narodytska et al.(narodytska_2016)

Narodytska et al.(narodytska_2017)

Brendel et al.(brendel)

Chen et al.(chen_2019)

Brunner et al.(brunner)

Li et al.(li)

Alzantot et al.(alzantot)

Guo et al.(guo) ] [Combinatorics,rectangle,rounded corners,draw,align=center, myleaf= Moon et al.(moon) ] ] [above=30pt,align=center,anchor=center] Classification of prior work on black-box scenario-based attacks.;

4.1. Attack Techniques

We divide the proposed attack methodologies into 4 categories as shown in 4.0.1.

4.1.1. Gradient Estimation

  • ZOO: Chen et al.(chen_2017) proposed Zeroth Order Optimization(ZOO) to estimate the gradients of target DNN in order to produce an adversarial image. The threat model assumed by the authors is that the target model can only be queried to obtain the probability scores of all the classes. The loss function formulated is:


    The author then uses symmetric difference quotient(cite) to estimate the gradient :


    The above naive solution requires querying the model times where is the dimension of the input. So the author proposes 2 stochastic coordinate methods: ZOO-Adam and ZOO-Newton in which a gradient is estimated for a random coordinate and the update formula is obtained using ADAM(cite) and Newton’s Method(cite) until it reaches convergence. The authors also discuss about generation of noise in a lower dimension in order to improve efficiency and specify its advantages and disadvantages.

  • Limited Queries & Information: The authors in (ilyas_2018) take three primary cases into account to devise successful black-box adversarial attacks. The first case talks about the constraint of limited queries that can be made to the model by the adversary. This could again be of two types- one that contains a time limit and the other that is concerned with the monetary limit. The authors present a variant of Natural Evolution Strategies(NES) coupled with Projected Gradient Descent (PGD), as used in white-box attacks, to construct adversarial examples.

    The second case is modelled around the constraint of partial-information setting where the adversary can obtain confidence scores of the first classes for a given input. These scores may not add up to 1 since the adversary doesn’t have access to the probabilities for each possible classification label. To tackle this scenario, instead of beginning with the input image , it is recommended to begin with belonging to the target adversarial class . Hence, after each step, we need to ensure two things-

    • The target adversarial class needs to remain in the top-k classes at all points in time when the input image is being perturbed.

    • The probability of the input image getting classified as the target class increases with every iteration of PGD.


    Lastly, in the third scenario the adversary is not given any confidence scores. Rather, the adversary can only obtain the names of the classification labels for the given input data. The authors define a discretised score for an adversarial example to quantitatively represent the adversarial nature of the input image at each step , with access only to the sorted labels.

    The proposed approach was evaluated on InceptionV3 network with 78% accuracy and also on Google Cloud Vision (GCV) API which presents a real-world scenario to perform an adversarial attack. 90% and above accuracy is achieved for all three settings when the InceptionV3 network is attacked in the above-mentioned techniques.

  • Opt-Attack: Cheng et al.(cheng) attempt to devise a black box adversarial attack in a much stricter hard-label setting. This means that querying the target model gives only the target label unlike other threat models where -class probability scores or labels are considered. The authors make the attack query-efficient by treating the problem as a real valued, continuous optimization problem.