Log In Sign Up

Adversarial Robustness Toolbox v0.2.2

by   Maria-Irina Nicolae, et al.

Adversarial examples have become an indisputable threat to the security of modern AI systems based on deep neural networks (DNNs). The Adversarial Robustness Toolbox (ART) is a Python library designed to support researchers and developers in creating novel defence techniques, as well as in deploying practical defences of real-world AI systems. Researchers can use ART to benchmark novel defences against the state-of-the-art. For developers, the library provides interfaces which support the composition of comprehensive defence systems using individual methods as building blocks. The Adversarial Robustness Toolbox supports machine learning models (and deep neural networks (DNNs) specifically) implemented in any of the most popular deep learning frameworks (TensorFlow, Keras, PyTorch). Currently, the library is primarily intended to improve the adversarial robustness of visual recognition systems, however, future releases that will comprise adaptations to other data modes (such as speech, text or time series) are envisioned. The ART source code is released ( under an MIT license. The release includes code examples and extensive documentation ( to help researchers and developers get quickly started.


page 1

page 2

page 3

page 4


Adversarial Framework with Certified Robustness for Time-Series Domain via Statistical Features

Time-series data arises in many real-world applications (e.g., mobile he...

EagerPy: Writing Code That Works Natively with PyTorch, TensorFlow, JAX, and NumPy

EagerPy is a Python framework that lets you write code that automaticall...

Adversarial-Playground: A Visualization Suite for Adversarial Sample Generation

With growing interest in adversarial machine learning, it is important f...

Foolbox v0.8.0: A Python toolbox to benchmark the robustness of machine learning models

Even todays most advanced machine learning models are easily fooled by a...

PRoA: A Probabilistic Robustness Assessment against Functional Perturbations

In safety-critical deep learning applications robustness measurement is ...

DropNeuron: Simplifying the Structure of Deep Neural Networks

Deep learning using multi-layer neural networks (NNs) architecture manif...

Dynamic Time Warping based Adversarial Framework for Time-Series Domain

Despite the rapid progress on research in adversarial robustness of deep...

1 Introduction

The Adversarial Robustness Toolbox (ART) is an open source Python library containing state-of-the-art adversarial attacks and defences. It has been released under an MIT license and is available at

. It provides standardized interfaces for classifiers using any of the most popular deep learning frameworks (TensorFlow, Keras, PyTorch). The architecture of ART makes it easy to combine various defences, e.g. adversarial training with data preprocessing and runtime detection of adversarial inputs. ART is designed both for researchers who want to run large-scale experiments for benchmarking novel attacks or defences, and for developers who want to compose and deploy comprehensive defences for real-world machine learning applications.

The purpose of this document is to provide mathematical background and implementation details for the adversarial attacks and defences implemented in ART. It is complementary to the documentation hosted on Read the Docs (

). In particular, it fully explains the semantics and mathematical background of all attacks and defence hyperparameters, and it highlights any custom choices in the implementation. As such, it provides a single reference for all the state-of-the-art attacks and defences implemented in ART, whereas in the past researchers and developers often had to go through the details of the original papers and compare various implementations of the same algorithm.

This document is structured as follows: Section 2 provides background and introduces mathematical notation. Section 3 gives an overview of the ART architecture and library modules. The following sections cover the different modules in detail: Section 4 introduces the classifier modules, Section 5 the evasion attacks and Section 6 the evasion defences. Section 7.1 covers detection of evasion attacks, while Section 8 detection for poisoning, and Section 9 the metrics. Finally, Section 10 describes the versioning system.

2 Background

While early work in machine learning has often assumed a closed and trusted environment, attacks against the machine learning process and resulting models have received much attention in the past years. Adversarial machine learning is a field that aims to protect the machine learning pipeline to ensure its safety at training, test and inference time [29, 3, 6].

The threat of evasion attacks against machine learning models at test time was first highlighted by [4]. [32] investigated specifically the vulnerability of deep neural network (DNN) models and proposed an efficient algorithm for crafting adversarial examples for such models. Since then, there has been an explosion of work on proposing more advanced adversarial attacks, on understanding the phenomenon of adversarial examples, on assessing the robustness of specific DNN architectures and learning paradigms, and on proposing as well as evaluating various defence strategies against evasion attacks.

Generally, the objective of an evasion attack is to modify the input to a classifier such that it is misclassified, while keeping the modification as small as possible. An important distinction is between untargeted and targeted attacks: If untargeted, the attacker aims for a misclassification of the modified input without any constraints on what the new class should be; if targeted, the new class is specified by the attacker. Another important distinction is between black-box and white-box attacks: in the white-box case, the attacker has full access to the architecture and parameters of the classifier. For a black-box attack, this is not the case. A typical strategy there is to use a surrogate model for crafting the attacks, and exploiting the transferability of adversarial examples that has been demonstrated among a variety of architectures (in the image classification domain, at least). Another way to approach the black-box threat model is through the use of zero-order optimization: attacks in this category are able to produce adversarial samples without accessing model gradients at all (e.g. ZOO [8]). They rely on zero-order approximations of the target model. One can also consider a wide range of grey-box settings in which the attacker may not have access to the classifier’s parameters, but to its architecture, training algorithm or training data.

On the adversarial defence side, two different strategies can be considered: model hardening and runtime detection of adversarial inputs. Among the model hardening methods, a widely explored approach is to augment the training data of the classifier, e.g. by adversarial examples (so-called adversarial training [11, 23]) or other augmentation methods. Another approach is input data preprocessing, often using non-differentiable or randomized transformation [13], transformations reducing the dimensionality of the inputs [36], or transformations aiming to project inputs onto the “true” data manifold [21]. Other model hardening approaches involve special types of regularization during model training [31], or modifying elements of the classifier’s architecture [37].

Note that robustness metrics are a key element to measure the vulnerability of a classifier with respect to particular attacks, and to assess the effectiveness of adversarial defences. Typically such metrics quantify the amount of perturbation that is required to cause a misclassification or, more generally, the sensitivity of model outputs with respect to changes in their inputs.

Poisoning attacks are another threat to machine learning systems executed at data collection and training time. Machine learning systems often assume that the data used for training can be trusted and fully reflects the population of interest. However, data collection and curation processes are often not fully controlled by the owner or stakeholders of the model. For example, common data sources include social media, crowdsourcing, consumer behavior and internet of the things measurements. This lack of control creates a threat of poisoning attacks where adversaries have the opportunity of manipulating the training data to significantly decrease overall performance, cause targeted misclassification or bad behavior, and insert backdoors and neural trojans [3, 27, 15, 12, 19, 20, 5, 26]. Defenses for this threat aim to detect and filter malicious training data [3, 2, 29].

Mathematical notation

In the remainder of this section, we are introducing mathematical notation that will be used for the explanation of the various attacks and defence techniques in the following. Table 1 lists the key notation for quick reference.

Additionally, we will be using the following common shorthand notation for standard Python libraries in code examples:

  • np for Numpy

  • torch for PyTorch

  • tf for TensorFlow

  • k for Keras backend (keras.backend).

Notation Description
Space of classifier inputs
Classifier input
Minimum clipping value for classifier inputs
Maximum clipping value for classifier inputs
Function clipping at and , respectively
norm of
Function returning with smallest norm satisfying
Space of classifier outputs (=labels)
Cardinality of (note: we assume )

Classifier logits (ranging in


Class probabilities (

Classification of input (also used to denote the classifier itself)
Untargeted adversarial perturbation
Targeted adversarial perturbation
Adversarial sample
Loss function
Loss gradients
Logit gradients
Output gradients
Table 1: Summary of notation.

The notion of classifier will be central in the following. By we denote the classifier inputs. For most parts, we are concerned with classifier inputs that are images and hence assume the space of classifier inputs is , where is the width, the height, and the number of color channels of the image444Typically or depending on whether is a greyscale or colored image.. We assume that the classifier inputs have minimum and maximum clipping values and , respectively, i.e. each component of lies within the interval . For images, this interval is typically in case of -bit pixel values, or if the pixel values have been normalized. For an arbitrary , we use to denote the input that is obtained by clipping each component of at and , respectively. Moreover we write for the norm of , and for the function which returns among all satisfying the one with smallest norm . In the special case this is equivalent to multiplying component-wise with the minimum of and , and in the case equivalent to clipping each component at .

By we denote the classifier outputs, which we also refer to as labels. We always assume the space of classifier outputs is , i.e. there are different classes. For most parts, we will consider classifiers based on a logit function . We refer to as the logits of the classifier for input . The class probabilities are obtained by applying the softmax function, , i.e.  for . Finally, we write for the classification of the input :

which, since softmax is a monotonic transformation, is equal to . With slight abuse of notation, we often use to denote the classifier itself.

An untargeted adversarial attack is a (potentially stochastic) mapping , aiming to change the output of the classifier, i.e. , while keeping the perturbation small with respect to a particular norm (most commonly, ). Similarly, a targeted adversarial attack is a (potentially stochastic) mapping , aiming to ensure the classifier outputs a specified class, i.e. , while keeping the perturbation small. We call (or analogously ) the adversarial sample generated by the targeted attack (untargeted attack ). In practice, we often consider (analogously, ) to ensure the adversarial samples are in the valid data range.

Finally, the following objects play an important role in the generation of adversarial samples:

  • The loss function that was used to train the classifier. In many cases, this is the cross-entropy loss: .

  • The loss gradient, i.e. the gradient of the classifier’s loss function with respect to .

  • The class gradients, i.e. either the logit gradients or the output gradients with respect to .

3 Library Modules

The library is structured as follows:

.1 art/. .2 attacks/. .3 .3 .3 .3 .3 .3 .3 .3 .3 .2 classifiers/. .3 .3 .3 .3 .3 .2 defences/. .3 .3 .3 .3 .3 .3 .2 detection/. .3 .2 poison_detection/. .3 .3 .3 .3 .3 .3 .2 .2

The following sections each address one module of the library. The description of each library module is organized as follows: we first introduce the general concept, mathematical notation and a formal definition. We then provide a functional description of the library module. Finally, we give examples of how to use the respective functionality.

4 Classifiers


This module contains the functional API allowing for the integration of classification models into the library. The API is framework-independent, and multiple machine learning backends are currently supported. The modularity of the library also allows for new frameworks to be incorporated with minimal effort. The following framework-specific classifier implementations are supported in the current version of ART:

  • art.classifiers.KerasClassifier: support for the Keras [10] backend (see Section 4.2)

  • art.classifiers.MXClassifier: support for MXNet [9] (detailed in Section 4.3)

  • art.classifiers.PyTorchClassifier: support for PyTorch [30] (Section 4.4)

  • art.classifiers.TFClassifier: implementation for TensorFlow [1] (see Section 4.5).

All these are extensions of the Classifier base class, described in the following.

4.1 The Classifier Base Class

art.classifiers.Classifier art/classifiers/

The Classifier abstract class provides access to the components and properties of a classifier that are required for adversarial attacks and defences. It abstracts from the actual framework in which the classifier is implemented (e.g. TensorFlow, PyTorch, etc.), and hence makes the modules for adversarial attacks and defences framework-independent. This allows for easy extension when adding support for a new framework in ART: one only has to extend the base Classifier without need of reimplementing the other modules making use of it.

The public interface of the Classifier grants access to the following properties:

  • channel_index: the index of the axis containing the colour channel in the data.

  • clip_values: the range of the data as a tuple .

  • input_shape: the shape of one input sample.

  • layer_names: a list with the names of the layers in the model, ordered from input towards output. Note: this list does not include the input and output layers, only the hidden ones. The correctness of this property is not guaranteed and depends on the extent where this information can be extracted from the underlying model.

  • nb_classes: the number of output classes .

Additionally, each class extending Classifier has to provide the following functions:

  • __init__(clip_values, channel_index, defences=None, preprocessing=(0, 1))

    Initializes the classifier with the given clip values and defences. The preprocessing tuple of the form (substractor, divider) indicate two float values to be substracted, respectively used to divide the inputs as preprocessing operations. These will always be applied by the classifier before performing any operation on data. The default values (0, 1) correspond to no preprocessing being applied. The defences parameter is either a string or a list of strings selecting the defences to be applied from the following:

    • featsqueeze[1-8]? for feature squeezing, where the digit at the end indicates the bit depth (see Section 6.3)

    • labsmooth for label smoothing (see Section 6.4)

    • smooth for spatial smoothing (see Section 6.5).

  • predict(x, logits=False) -> np.ndarray

    Returns predictions of the classifier for the given inputs. If logits is False, then the class probabilities are returned, otherwise the class logits (predictions before softmax). The shape of the returned array is where is the number of given samples and the number of classes.

  • class_gradient(x, label=None, logits=False) -> np.ndarray

    Returns the gradients of the class probabilities or logits (depending on the value of the logits parameter), evaluated at the given inputs. Specifying a class label

    (one-hot encoded) will only compute the class gradients for the respective class. Otherwise, gradients for all classes will be computed. The shape of the returned array is of the form

    when no label is specified, or for a given label, where is the shape of the classifier inputs.

  • loss_gradient(x, y) -> np.ndarray

    Returns the loss gradient evaluated at the given inputs x and y. The labels are assumed to be one-hot encoded, i.e. the label is encoded as the

    th standard basis vector

    . The shape of the returned array is of the form under the above notation.

  • fit(x, y, batch_size=128, nb_epochs=20) -> None

    Fits the classifier to the given data, using the provided batch size and number of epochs. The labels are assumed to be one-hot encoded.

  • get_activations(x, layer) -> np.ndarray

    Computes and returns the values of the activations (outputs) of the specified layer index for the given data x. The layer index goes from 0 to the total number of internal layers minus one. The input and output layers are not considered in the total number of available layers.

4.2 Keras Implementation

art.classifiers.KerasClassifier art/classifiers/

This class provides support for Keras models. It differs from the base Classifier only in the signature of the constructor:

  • __init__(clip_values, model, use_logits=False, channel_index=3, defences=None, preprocessing=(0, 1), input_layer=0, output_layer=0):

    • clip_values, channel_index, defences and preprocessing correspond to the parameters of the base class. channel_index

      should be set to 3 when using the TensorFlow backend for Keras, and 1 when using Theano.

    • model is the compiled Keras Model object.

    • use_logits should be set to True when the output of the model parameter are the logits; otherwise, probabilities as output are assumed.

    • input_layer and output_layer are two integers that allow the integration of Keras models with multiple input and outputs layers into the library. They specify the indices of the layers to be considered respectively as input and output of the classifier when computing gradients. In the case of models with only one input and output, these values do not need to be specified.

4.3 MXNet Implementation

art.classifiers.MXClassifier art/classifiers/

This is the class supporting the integration of MXNet Gluon models. The constructor is as follows:

  • __init__(clip_values, model, input_shape, nb_classes, optimizer=None, ctx=None, channel_index=1, defences=None, preprocessing=(0, 1)), where:

    • clip_values, channel_index, defences and preprocessing correspond to the parameters of the base class.

    • model is the mxnet.gluon.Block object containing the model.

    • input_shape is the shape of one input.

    • nb_classes is the number of classes in the model.

    • optimizer is the mxnet.gluon.Trainer used to train the classifier. This parameter is only required if fitting will be done through the Classifier interface.

    • ctx is the device on which the model runs (CPU or GPU).

4.4 PyTorch Implementation

art.classifiers.PyTorchClassifier art/classifiers/

This class allows for the integration of PyTorch models. The signature of the constructor is as follows:

  • __init__(clip_values, model, loss, optimizer, input_shape, nb_classes,
    channel_index=1, defences=None, preprocessing=(0, 1))
    , where:

    • clip_values, channel_index, defences and preprocessing correspond to the parameters of the base class.

    • model is the torch.nn.module object containing the model.

    • loss is the loss function for training.

    • optimizer is the optimizer used to train the classifier.

    • input_shape is the shape of one input.

    • nb_classes is the number of classes of the classifier.

4.5 TensorFlow Implementation

art.classifiers.TFClassifier art/classifiers/

The TFClassifier provides a wrapper around TensorFlow models allowing to integrate them under the Classifier API. Its __init__ function takes the following parameters:

  • __init__(clip_values, input_ph, logits, output_ph=None, train=None,
    loss=None, learning=None, sess=None, channel_index=3, defences=None, preprocessing=(0, 1))
    , where:

    • clip_values, channel_index, defences and preprocessing correspond to the parameters of the base class.

    • input_ph: The input placeholder of the model.

    • logits: The logits layer.

    • output_ph: The label placeholder.

    • train: The symbolic training objective.

    • loss: The symbolic loss function.

    • learning: The placeholder to indicate if the model is training.

    • sess: The TensorFlow session.

Note that many of these parameters are only required when training a TensorFlow classifier through ART. When using a ready-trained model, only clip_values, input_ph, output_ph, logits and sess are required.

5 Attacks


This module contains all the attack methods supported by the library. These require access to a Classifier object, which is the target of the attack. By using the framework-independent API to access the targeted model, the attack implementation becomes agnostic to the framework used for training the model. It is thus easy to implement new attacks in the library.

The following attacks are currently implemented in ART:

  • art.attacks.FastGradientMethod: Fast Gradient Sign Method (FGSM) [11], presented in Section 5.2

  • art.attacks.BasicIterativeMethod: Basic Iterative Method (BIM) [18], detailed in Section 5.3

  • art.attacks.SaliencyMapMethod: Jacobian Saliency Map Attack (JSMA) [28], see Section 5.4

  • art.attacks.CarliniL2Method: Carlini & Wagner attack [7], see Section 5.5

  • art.attacks.DeepFool: DeepFool [24], see Section 5.6

  • art.attacks.UniversalPerturbation: Universal Perturbation [25], see Section 5.7

  • art.attacks.NewtonFool: NewtonFool [16], see Section 5.8

  • art.attacks.VirtualAdversarialMethod: Virtual Adversarial Method [23], detailed in Section 5.9.

We now describe the base class behind all attack implementations.

5.1 The Attack Base Class

art.attacks.Attack art/attacks/

ART has an abstract class Attack in the art.attacks.attack module which implements a common interface for any of the particular attacks implemented in the library. The class has an attribute classifier, an instance of the Classifier class, which is the classifier that the attack aims at. Moreover, the class has the following public methods:

  • __init__(classifier)

    Initializes the attack with the given classifier.

  • generate(x, kwargs) -> np.ndarray

    Applies the attack to the given input x, using any attack-specific parameters provided in the kwargs dictionary. The parameters provided in the dictionary are also set in the attack attributes. Returns the perturbed inputs in an np.ndarray which has the same shape as x.

  • set_params(kwargs) -> bool

    Initializes attack-specific hyper-parameters provided in the kwargs dictionary; returns “True” if the hyper-parameters were valid and the initialization successful, “False” otherwise.

5.2 Fgsm

art.attacks.FastGradientMethod art/attacks/

The Fast Gradient Sign Method (FGSM) [11] works both in targeted and untargeted settings, and aims at controlling either the , or norm of the adversarial perturbation. In the targeted case and for the norm, the adversarial perturbation generated by the FGSM attack is given by

where is the attack strength and is the target class specified by the attacker. The adversarial sample is given by

Intuitively, the attack transforms the input to reduce the classifier’s loss when classifying it as . For the norms with , the adversarial perturbation is calculated as

Note that “Sign” in the attack name “FGSM” refers to the specific calculation for the norm (for which the attack was originally introduced). Sometimes the modified attacks for the and norms are referred to as “Fast Gradient Methods” (FGM), nevertheless we will also refer to them as FGSM for simplicity. The untargeted version of the FGSM attack is devised as


i.e. the input is transformed to increase the classifier’s loss when continuing to classify it as .

ART also implements an extension of the FGSM attack, in which the minimum perturbation is determined for which . This modification takes as input two extra floating-point parameters: and . It sequentially performs the standard FGSM attack with strength for until either the attack is successful (and the resulting adversarial sample is returned), or , in which case the attack has failed.

The main advantage of FGSM is that it is very efficient to compute: only one gradient evaluation is required, and the attack can be applied straight-forward to a batch of inputs. This makes the FGSM (or variants thereof) a popular choice for adversarial training (see Section 6.1) in which a large number of adversarial samples needs to be generated.

The strength of the FGSM attack depends on the choice of the parameter . If is too small, then might not differ from . On the other hand, the norm grows linearly with . When choosing , it is particularly important to consider the actual data range .

Implementation details

The FastGradientMethod class has the following attributes:

  • norm: The norm of the adversarial perturbation (must be either np.inf, 1 or 2).

  • eps: The attack strength (must be greater than ).

  • targeted: Indicating whether the attack is targeted (True) or untargeted (False).

The functions in the class take the following form:

  • __init__(classifier, norm=np.inf, eps=0.3, targeted=False)

    Initializes an FGSM attack instance.

  • generate(x, **kwargs) -> np.ndarray

    Applies the attack to the given input x. The function accepts the same parameters as ‘__init__‘, but has two additional parameters:

    • y: An np.ndarray containing labels for the inputs x in one-hot encoding. If the attack is targeted, y is required and specifies the target classes. If the attack is untargeted, y is optional and overwrites the argument in (1). Note that it is not advisable to provide true labels in the untargeted case, as this may lead to the so-called label leaking effect [18].

    • minimal: True if the minimal perturbation should be computed. In that case, also eps_step for the step size and eps_max for the maximum perturbation should be provided (default values are 0.1 and 1.0, respectively).

5.3 Basic Iterative Method

art.attacks.BasicIterativeMethod art/attacks/

The Basic Iterative Method (BIM) [18] is a straightforward extension of FGSM that applies the attack multiple times, iteratively. This attack differs from FGSM in that it is targeted towards the least likely class for a given sample, i.e. the class for which the model outputs the lowest score. Like in the case of FGSM, BIM is limited by a total attack budget , and it has an additional parameter determining the step size at each iteration. At each step, the result of the attack is projected back onto the -size ball centered around the original input. The attack is originally designed for norm perturbations, but can easily be extended to other norms.

Implementation details

The BasicIterativeMethod class has the following specific parameters:

  • norm: The norm of the adversarial perturbation (must be either np.inf, 1 or 2).

  • eps: The attack strength (must be greater than ).

  • eps_step: The step size to be taken at each iteration.

The functions in the class follow the Attack API. The previous parameters can be set either in __init__ or when calling the generate function.

5.4 Jsma

art.attacks.SaliencyMapMethod art/attacks/

The Jacobian-based Saliency Map Attack (JSMA) [28] is a targeted attack which aims at controlling the norm, i.e. the number of components of that are being modified when crafting the adversarial sample . The attack iteratively modifies individual components of until either the targeted misclassification is achieved or the total number of modified components exceeds a specified budget.

Details are outlined in Algorithm 1. To simplify notation, we let denote the total number of components of and refer to individual components using subscripts: for . The key step is the computation of the saliency map (line 4), which is outlined in Algorithm 2. Essentially, the saliency map determines the components , of to be modified next based on how much this would increase the probability of the target class (captured in the sum of partial derivatives ), and decrease the sum of probabilities over all other classes (captured in the sum of partial derivatives ). In particular, line 4 in Algorithm 2 ensures that is positive, is negative and is maximal over all pairs in the search space. Considering pairs instead of individual components is motivated by the observation that this increases the chances of satisfying the conditions and ([28], page 9). In our implementation of the saliency map, we exploit that can be expressed as

(due to ), and thus , are the two largest components of .

Line 9 in Algorithm 1 applies the perturbations to the components determined by the saliency map. The set keeps track of all the components that still can be modified (i.e. the clipping value has not been exceeded), and of all the components that have been modified at least once. The algorithm terminates if the attack has succeeded (i.e. ), the number of modified components has exhausted the budget (i.e. ), less than components can still be modified (i.e. ), or the saliency map returns the saliency score (indicating it hasn’t succeeded in determining components satisfying the conditions in line 4 of Algorithm 2).

0:   : Input to be adversarially perturbed : Target label : Amount of perturbation per step and feature (assumed to be positive; see discussion in text) : Maximum fraction of features to be perturbed (between and )
3:  while  and  do
6:     if  or  then
7:        break
8:     end if
9:     ,
11:  end while
12:   Adversarial sample .
Algorithm 1 JSMA method
0:   : Input to be adversarially perturbed : Target label : Set of indices of to be explored.
1:  , ,
2:  for  each subset with  do
3:      Compute
4:     if  and and  then
5:        , ,
6:     end if
7:  end for
7:   Maximum saliency score and selected components .
Algorithm 2 JSMA saliency_map

Two final comments on the JSMA method:

  • Algorithms 1 and 2 describe the method for positive values of the input parameter , resulting in perturbations where the components of are increased while crafting . In principle, it is possible to also use negative values for . In fact, the experiments in [28] (Section IV D) suggest that using a negative results in perturbations that are harder to detect by the human eye555Which however might be specific to the MNIST data set that was used in those experiments.. The implementation in ART supports both positive and negative , however note that for negative the following changes apply:

    • In Algorithm 1, line 2: change the condition to .

    • In Algorithm 1, line 10: change the condition to .

    • In Algorithm 2, line 4: change the conditions and to and , respectively.

  • While Algorithm 2 outlines the computation of the saliency map based on the partial derivates of the classifier probabilities , one can, in principle, consider the partial derivates of the classifier logits instead. As discussed in [7], it appears that both variants have been implemented and used in practice. The current implementation in ART uses the partial derivates of the classifier probabilities which, as explained above, can be performed particularly efficiently.

Implementation details

The SaliencyMapMethod class has the following attributes:

  • theta: Amount of perturbation per step and component (can be positive or negative).

  • gamma: Maximum fraction of components to be perturbed (must be and ).

The functions in the class take the following form:

  • __init__(classifier, theta=0.1, gamma=1)

    Initializes a JSMA attack instance.

  • generate(x, **kwargs) -> np.ndarray

    Applies the attack to the given input x. The function accepts the same parameters as __init__, but has one additional parameter:

    • y, which is an np.ndarray containing the target labels, one-hot encoded. If not provided, target labels will be sampled uniformly from .

5.5 Carlini & Wagner attack

art.attacks.CarliniL2Method art/attacks/

The Carlini & Wagner (C&W) attack [7] is a targeted attack which aims to minimize the norm of adversarial perturbations666Also and versions of the C&W attack exist; they build upon the attack.. Below we will discuss how to perform untargeted C&W attacks. For a fixed input , a target label and a confidence parameter , consider the objective function




Note that if and only if and the logit exceeds any other logit by at least , thus relating to the classifier’s confidence in the output . The C&W attack aims at finding the smallest for which the minimizing is such that . This can be regarded as the optimal trade-off between achieving the adversarial target while keeping the adversarial perturbation as small as possible. In the ART implementation, binary search is used for finding such .

Details are outlined in Algorithm 3. Note that, in lines 1-2, the components of are mapped from onto , which is the space in which the adversarial sample is created. Working in avoids the need for clipping during the process. The output is transformed back to the range in lines 15-16. Algorithm 3 relies on two helper functions:

  • minimize_objective (line 7): this function aims at minimizing the objective function (2) for the given value of

    . In the ART implementation, the minimization is performed using Stochastic Gradient Descent (SGD) with decayed learning rate; the minimization is aborted if a specified number of iterations is exceeded.

    111The original implementation in [7] used the Adam optimizer for the minimization problem and implemented an additional stopping criterion. During the minimization, among all samples for which , the one with the smallest norm is retained and returned at the end of the minimization.

  • update (line 12): this function updates the parameters , , used in the binary search. Specifically, if an adversarial sample with was found for the previous value of , then remains unchanged, is set to and is set to False. If , then is set to , and is either set to (if is True) or to (if is False).222This is a slight simplification of the original implementation in [7]. The binary search is abandoned if exceeds the upper bound (line 6).

The untargeted version of the C&W attack aims at changing the original classification, i.e. constructing an adversarial sample with the only constraint that . The only difference in the algorithm is that the objective function (3) is modified as follows:


where is the original label of (or the prediction thereof). Thus, if and only if there exists a label such that the logit exceeds the logit by at least .

0:   : Input to be adversarially perturbed : Target label : constant to avoid over-/underflow in (arc)tanh computations : initialization of the binary search constant : upper bound for the binary search constant : number of binary search steps to be performed
4:  , ,
6:  while  and  do
8:     if  and  then
11:     end if
14:  end while
16:   Adversarial sample .
Algorithm 3 Carlini & Wagner’s attack (targeted)

Implementation details

The C&W attack is implemented in the CarliniL2Method class, with the following attributes:

  • confidence: value to be used in Objective (3).

  • targeted: Specifies whether the attack should be targeted (True) or not (False).

  • learning_rate: SGD learning rate to be used in minimize_objective (line 7).

  • decay: SGD decay factor to be used in minimize_objective .

  • max_iter: Maximum number of SGD iterations to be used in minimize_objective.

  • binary_search_steps: Number of binary search steps.

  • initial_const: Initial value of the binary search variable .

The signatures of the functions are:

  • __init__(classifier, confidence=5, targeted=True, learning_rate=1e-4, binary_search_steps=25, max_iter=1000, initial_const=1e-4, decay=0)

    Initialize a C&W attack with the given parameters.

  • generate(x, **kwargs) -> np.ndarray

    Applies the attack to the given input x. The function accepts the same parameters as __init__, and an additional parameter:

    • y, which is an np.ndarray containing labels (one-hot encoded). In the case of a targeted attack, y is required and specifies the target labels. In the case of an untargeted attack, y is optional (if not provided, will be used); as commented earlier for the FGSM attack, it is advised not to provide the true labels for untargeted attacks in order to avoid label leaking.

    If binary_search_steps is large, then the algorithm is not very sensitive to the value of initial_const. The default value initial_const=1e-4 is suggested in [7]. Note that the values and in Algorithm 3 are hardcoded with the same values used by the authors of the method.

5.6 DeepFool

art.attacks.DeepFool art/attacks/

DeepFool [24] is an untargeted attack which aims, for a given input , to find the nearest decision boundary in norm333While [24] also explains how to adapt the algorithm to minimize the or norms of adversarial perturbations by modifying the expressions in (5)-(6), this is currently not implemented in ART.. Implementation details are provided in Algorithm 4. The basic idea is to project the input onto the nearest decision boundary; since the decision boundaries are non-linear, this is done iteratively. DeepFool often results in adversarial samples that lie exactly on a decision boundary; in order to push the samples over the boundaries and thus change their classification, the final adversarial perturbation is multiplied by a factor (see line 8).

0:   : Input to be adversarially perturbed : Maximum number of projection steps : Overshoot parameter (must be )
3:  while  and  do
4:     Compute
7:  end while
8:   Adversarial sample .
Algorithm 4 DeepFool attack

Implementation details

The DeepFool class has two attributes:

  • max_iter: Maximum number of projection steps to be undertaken.

  • overshoot: Overshoot parameter (must be ).

The class functions implement the Attack API in the following:

  • __init__(classifier, max_iter=100)

    Initialize a DeepFool attack with the given parameters.

  • generate(x, **kwargs) -> np.ndarray

    Apply the attack to x. The same parameters as for __init__ can be specified at this point.

5.7 Universal Adversarial Perturbations

art.attacks.UniversalPerturbation art/attacks/

Universal adversarial perturbations [25] are a special type of untargeted attacks, aiming to create a constant perturbation that successfuly alters the classification of a specified fraction of inputs. The universal perturbation is crafted using a given untargeted attack . Essentially, as long as the target fooling rate has not been achieved or the maximum number of iterations has not been reached, the algorithm iteratively adjusts the universal perturbation by adding refinements which help to perturb additional samples from the input set; after each iteration, the universal perturbation is projected into the ball with radius in order to control the attack strength. Details are provided in Algorithm 5.

0:   : Set of inputs to be used for constructing the universal adversarial perturbation : Adversarial attack to be used : Attack failure tolerance ( is the target fooling rate) : Attack step size : Norm of the adversarial perturbation : Maximum number of iterations
4:  while  and  do
5:     for  in random order do
6:        if  then
8:           if  then
10:           end if
11:        end if
12:     end for
15:  end while
15:   Adversarial samples for .
Algorithm 5 Universal Adversarial Perturbation

Implementation details

The UniversalPerturbation class has the following attributes:

  • attacker: A string representing which attack should be used in Algorithm 5. The following attacks are supported: carlini (Carlini & Wagner attack), deepfool (DeepFool), fgsm (Fast Gradient Sign Method), jsma (JSMA), newtonfool (NewtonFool) and vat (Virtual Adversarial Method).

  • attacker_params: A dictionary with attack-specific parameters that will be passed onto the selected attacker. If this parameter is not specified, the default values for the chosen attack are used.

  • delta: Attack failure tolerance (must lie in ).

  • max_iter: Maximum number of iterations .

  • eps: Attack step size .

  • norm: Norm of the adversarial perturbation (must be either np.inf, 1 or 2).

The class functions implement the Attack API as follows:

  • __init__(classifier, attacker=’deepfool’, attacker_params=None, delta=0.2, max_iter=20, eps=10.0, norm=np.inf)

    Initialize a Universal Perturbation attack with the given parameters.

  • generate(x, **kwargs) -> np.ndarray

    Apply the attack to x. The same parameters as for __init__ can be specified at this point.

5.8 NewtonFool

art.attacks.NewtonFool art/attacks/

NewtonFool [16] is an untargeted attack that tries to decrease the probability of the original class by performing gradient descent. The step size is determined adaptively in Equation (7): when is larger than (recall is the number of classes in ), which is the case as long as , then the second term will dominate; once approaches or falls below , the first term will dominate. The tuning parameter determines how aggressively the gradient descent attempts to minimize the probability of class . The method is described in detail in Algorithm 6.

0:   : Input to be adversarially perturbed : Strength of adversarial perturbations : Maximum number of iterations
1:  , ,
2:  while  do
3:     Compute
6:  end while
6:   Adversarial sample .
Algorithm 6 NewtonFool attack

Implementation details

The NewtonFool class has the following attributes:

  • eta: Strength of adversarial perturbations .

  • max_iter: Maximum number of iterations .

The class functions implement the Attack API as follows:

  • __init__(classifier, max_iter=100, eta=0.01)

    Initialize NewtonFool with the given parameters.

  • generate(x, **kwargs) -> np.ndarray

    Apply the attack to x. The same parameters as for __init__ can be specified at this point.

5.9 Virtual Adversarial Method

art.attacks.VirtualAdversarialMethod art/attacks/

The Virtual Adversarial Method [22] is not intended to create adversarial samples resulting in misclassification, but rather samples that, if included in the training set for adversarial training, result in local distributional smoothness of the trained model. We use to denote the total number of components of classifier inputs and assume, for sake of convenience, that . By we denote a random sample of the

-dimensional standard normal distribution, and by

the th standard basis vector of dimension . The key idea behind the algorithm is to construct a perturbation with unit norm maximizing the Kullback-Leibler (KL) divergence . In Algorithm 7 this is done iteratively via gradient ascent along finite differences (line 9). The final adversarial example is constructed by adding to the original input (line 14), where is the perturbation strength parameter provided by the user.

0:   : Input to be adversarially perturbed : Perturbation strength : Finite differences width: Maximum number of iterations
4:  while  do
7:     for  do
10:     end for
13:  end while
14:   Adversarial sample .
Algorithm 7 Virtual Adversarial Method with finite differences

Implementation details

The VirtualAdversarialMethod class has the following attributes:

  • eps: Strength of adversarial perturbations .

  • finite_diff: Finite differences width .

  • max_iter: Maximum number of iterations .

The following functions are accessible to users:

  • __init__(classifier, max_iter=1, finite_diff=1e-6, eps=0.1)

    Initialize an instance of the virtual vdversarial perturbation generator for Virtual Adversarial Training.

  • generate(x, kwargs) -> np.ndarray

    Compute the perturbation on x. The kwargs dictionary allows the user to overwrite any of the attack parameters, which will be set in the class attributes. Returns an array containing the adversarially perturbed sample(s).

6 Defences


There is a plethora of methods for defending against evasion attacks which can roughly be categorized as follows:

Model hardening

refers to techniques resulting in a new classifier with better robustness properties than the original one with respect to some given metrics.

Data preprocessing

techniques achieve higher robustness by using transformations of the classifier inputs and labels, respectively, at test and/or training time.

Runtime detection

of adversarial samples by extending the original classifier with a detector in order to check whether a given input is adversarial or not.

The following defences are currently implemented in ART:

  • art.defences.AdversarialTrainer: adversarial training, a method for model hardening (see Section 6.1).

  • art.defences.FeatureSqueezing: feature squeezing, a defence based on input data preprocessing (see Section 6.3)

  • art.defences.LabelSmoothing: label smoothing, a defence based on processing the labels before model training (described in Section 6.4).

  • art.defences.SpatialSmoothing: spatial smoothing, a defence based on input data preprocessing (see Section 6.5).

  • art.defences.GaussianAugmentation: data augmentation based on Gaussian noise, meant to reinforce the structure of the model (described in Section 6.6).

6.1 Adversarial Training

art.defences.AdversarialTrainer art/defences/

The idea of adversarial training [11] is to improve the robustness of the classifier by including adversarial samples in the training set. A special case of adversarial training is virtual adversarial training [23] where the adversarial samples are generated by the Virtual Adversarial Method (see Section 5.9).

The ART implementation of adversarial training incorporates the original protocol, as well as ensemble adversarial training [33], training fully on adversarial data and other common setups. If multiple attacks are specified, they are rotated for each batch. If the specified attacks have as target a different model that the one being trained, then the attack is transferred. A ratio parameter determines how many of the clean samples in each batch are replaced with their adversarial counterpart. When the attack targets the current classifier, only successful adversarial samples are used.

Implementation details

The AdversarialTrainer class has the following public functions:

  • __init__(classifier, attacks, ratio=.5),

    The parameters of the constructor are the following:

    • classifier: The classifier to be hardened by adversarial training.

    • attacks: Attack or list of attacks to be used for adversarial training. These are instances of the ‘Attack‘ class.

    • ratio: The proportion of samples in each batch to be replaced with their adversarial counterparts. Setting this value to 1 allows to train only on adversarial samples.

    Each instance of the class Attack (corresponding to in the notation above), has an attribute classifier with the classifier the attack aims at, corresponding to in the notation above.

  • fit(x, y, batch_size=128, nb_epochs=20) -> None

    Method for training the hardened classifier. x and y are the original training inputs and labels, respectively. The method applies the adversarial attacks specified in the attacks parameter of __init__ to obtain the enhanced training set as explained above. Then the fit function of the classifier specified in the class attribute classifier is called with the enhanced training set as input. The hardened classifier is trained with the given batch size for as many epochs as specified. After calling this function, the classifier attribute of the class contains the hardened classifier .

  • predict(x, kwargs) -> np.ndarray

    Calls the predict function of the hardened classifier , passing on the dictionary of arguments kwargs).

6.2 The Data Preprocessor Base class

art.defences.Preprocessor art/defences/

The abstract class Preprocessor provides a unified interface for all data preprocessing transformations to be used as defences. The class has the following public methods:

  • fit(x, y=None, kwargs): Fit the transformations with the given data and parameters (where applicable).

  • __call__(x, y=None): Applies the transformations to the provided labels and/or inputs and returns the transformed data.

  • is_fitted: Check whether the transformations have been fitted (where applicable).

Feature squeezing, label smoothing, spatial smoothing and Gaussian data augmentation are all data preprocessing techniques. Their implementation is based on the Preprocessor abstract class. We describe them individually in the following sections.

6.3 Feature Squeezing

art.defences.FeatureSqueezing art/defences/

Feature squeezing [36] reduces the precision of the components of by encoding them on a smaller number of bits. In the case of images, one can think of feature squeezing as reducing the common 8-bit pixel values to bits where . Formally, assuming that the components of all lie within (which would have been obtained, e.g., by dividing 8-bit pixel values by 255) and given the desired bit depth , feature squeezing applies the following transformation component-wise:


Future implementations of this method will support arbitrary data ranges of the components of .

Implementation details

The FeatureSqueezing class is an extension of the Preprocessor base class. Since feature squeezing does not require any model fitting for the data transformations, is_fitted always returns True, and the fit function does not have any effect. We now describe the functions and parameters specific to this class:

  • __init__(bit_depth=8)

    • bit_depth: The bit depth on which to encode features.

  • __call__(x, y=None, bit_depth=None) -> np.ndarray

    As feature squeezing only works on the inputs x, providing labels y to this function does not have any effect. The bit_depth parameter has the same meaning as in the constructor. Calling this method returns the squeezed inputs x.

Note that any instance of the Classifier class which has feature squeezing as one of their defences will automatically apply this operation when the fit or predict functions are called. The next example shows how to activate this defence when creating a classifier object from a Keras model:

{}] In [1]: from keras.models import load˙model

model = load˙model('keras˙model.h5') classifier = KerasClassifier((0, 1), model, defences='featsqueeze5')

Applying feature squeezing to data without using a classifier would be done as follows:

{}] In [1]: import numpy as np from art.defences import FeatureSqueezing

x = np.random.rand(50, 20) feat˙squeeze = FeatureSqueezing() x = feat˙squeeze(x, bit˙depth=1)

6.4 Label Smoothing

art.defences.LabelSmoothing art/defences/

Label smoothing [34] modifies the labels during model training: instead of using one-hot encoding, where is represented by the standard basis vector , the following representation is used:

where is a parameter specified by the user. The label representation is thus “smoothed” in the sense that the difference between its maximum and minimum components is reduced and its entropy increased. The motivation behind this approach is that it might help reducing gradients that an adversary could exploit in the construction of adversarial samples.

Implementation details

Same as feature squeezing, label smoothing does not require any model fitting, hence is_fitted is always True, and the fit function is a dummy. Its LabelSmoothing class specific functions are:

  • __init__(max_value=.9)

    • max_value: Represents in the notation above.

  • __call__(x, y, max_value=0.9) -> (np.ndarray, np.ndarray)

    This function expects one-hot encoded labels in y and returns the unmodified inputs x along with the smoothed labels. max_value can be specified directly when calling the method instead of the constructor.

Note that any instance of the Classifier class which has label smoothing activated in the defences will automatically apply it when the fit function is called. One can initiaze this defence when creating a classifier as follows:

{}] In [1]: from keras.models import load˙model

model = load˙model('keras˙model.h5') classifier = KerasClassifier((0, 1), model, defences='label˙smooth')

Label smoothing can be used directly to preprocess the dataset as follows:

{}] In [1]: import numpy as np from art.defences import LabelSmoothing

x = np.random.rand(50, 20) label˙smooth = LabelSmoothing() x = label˙smooth(x, max˙value=.8)

6.5 Spatial Smoothing

art.defences.SpatialSmoothing art/defences/

Spatial smoothing [36] is a defence specifically designed for images. It attempts to filter out adversarial signals using local spatial smoothing. Let denote the components of . Recall that indexes width, height and the color channel. Given a window size , the component is replaced by a median-filtered version:

where features at the borders are reflected where needed:

  • for

  • for

and analogously for . Note that the local spatial smoothing is applied separately in each color channel .

Implementation details

Same as for the previous two defences, is_fitted is always True, and the fit function does not have any effect. The signatures of functions in the SpatialSmoothing class are:

  • __init__(window_size=3)

    • window_size: The window size .

  • __call__(self, x, y=None, window_size=None) -> np.ndarray

    This method only requires inputs x and returns them with spatial smoothing applied.

Note that any instance of the Classifier class which has spatial smoothing as one of their defences will automatically apply it when the predict function is called. The next example shows how to initialize a classifier with spatial smoothing.

{}] In [1]: from keras.models import load˙model

model = load˙model('keras˙model.h5') classifier = KerasClassifier((0, 1), model, defences='smooth')

Spatial smoothing can be used directly to preprocess the dataset as follows:

{}] In [1]: import numpy as np from art.defences import SpatialSmoothing

x = np.random.rand(50, 20) smooth = SpatialSmoothing() x = smooth(x, window˙size=5)

6.6 Gaussian Data Augmentation

art.defences.GaussianAugmentation art/defences/

Gaussian data augmentation [37]

is a standard data augmentation technique in computer vision that has also been used to improve the robustness of a model to adversarial attacks. This method augments a dataset with copies of the original samples to which Gaussian noise has been added. An important advantage of this defence is its independence from the attack strategy. Its usage is mostly intended for the augmentation of the training set.

Implementation details

The GaussianAugmentation class extends the Preprocessor abstract class. Applying this method does not require training, thus the fit function is a dummy. The functions supported by this class are:

  • __init__(sigma=1., ratio=1.), where:

    • sigma

      : The standard deviation of the Gaussian noise to be added to the samples.

    • ratio: The augmentation ratio, i.e. how many adversarial samples to generate for each original sample. A ratio of 1 will double the size of the dataset.

  • __call__(x, y=None, sigma=None, ratio=None) -> (np.ndarray, np.ndarray)

    Notice that the labels y are not required, as the method does not apply any preprocessing to them. If provided, they will be returned as the second object in the tuple; otherwise, a single object containing the augmented version of x is returned. The other parameters are the same as for the constructor.

The following example shows how to use GaussianAugmentation. Notice that the size of the dataset doubles after processing. aug_x contains all the original samples and their Gaussian augmented copy. The order of the labels in aux_y corresponds to that of aug_x.

{}] In [1]: from keras.datasets import mnist from art.defences import GaussianAugmentation

(x˙train, y˙train), (˙, ˙) = mnist.load˙data()

ga = GaussianAugmentation(sigma=.3) aug˙x, aug˙y = ga(x˙train, y˙train) print(x˙train.shape, y˙train.shape, aug˙x.shape, aug˙y.shape)

7 Evasion Detection


This module fills the purpose of providing runtime detection methods for adversarial samples. Currently, the library implements two types of detectors:

  • art.detection.BinaryInputDetector: a detector based on opposing clean and adversarial data at the input level (see Section 7.2).

  • art.detection.BinaryActivationDetector a detector built based on the activation values produced by a neural network classifier at a given internal layer (see Section 7.3).

The previous methods implement the API provided in the Detector base class, that we describe in the follwing.

7.1 The Detector Base Class

art.detection.Detector art/detection/

The Detector abstract class provides a unified interface for all runtime adversarial detection methods. The class has the following public methods:

  • fit(x, y=None, kwargs) -> None

    Fit the detector with the given data and parameters (where applicable).

  • __call__(x) -> np.ndarray

    Applies detection to the provided inputs and returns binary decisions for each input sample.

  • is_fitted

    Check whether the detector has been fitted (where applicable).

7.2 Binary Detector Based on Inputs

art.detection.BinaryInputDetector art/detection/

This method builds a binary classifier, where the labels represent the fact the a given input is adversarial (label 1) or not (label 0). The detector is fit with a mix of clean and adversarial data; following this step, it is ready to detect adversarial inputs.

Implementation details

The BinaryInputDetector implements the Detector abstract class. Its constructor takes the following parameters:

  • __init__(detector), where detector is a Classifier object and represents the architecture that will be trained for detection.

7.3 Binary Detector Based on Activations

art.poison_detection.BinaryActivationDetector art/detection/

This method builds a binary classifier, where the labels represent the fact the a given input is adversarial (label 1) or not (label 0). The activations detector is different from the previous one in that it uses as inputs for training the values of the activations of a different classifier. Only activations from one layer that is specified are used.

Implementation details

The BinaryActivationDetector implements the Detector abstract class. The method is initialized with the following parameters:

  • __init__(classifier, detector, layer), where:

    • classifier is a trained Classifier object; its activation values will be used for training the detector.

    • detector is a Classifier object and represents the architecture that will be trained for detection.

    • layer represents the layer indice or layer name to be used for computing the activations of classifier.

8 Poisoning Detection


Data used to train machine learning models are often collected from potentially untrustworthy sources. This is particularly true for crowdsourced data (e.g. Amazon Mechanical Turk), social media data, and data collected from user behavior (e.g. customer satisfaction ratings, purchasing history, user traffic). Adversaries can craft inputs to modify the decision boundaries of machine learning models to misclassify inputs or to reduce model performance.

As part of targeted misclassification attacks, recent work has shown that adversaries can generate “backdoors” or “trojans” into machine learning models by inserting malicious data into the training set [12]. The resulting model performs very well on the intended training and testing data, but behaves badly on specific attacker-chosen inputs. As an example, it has been demonstrated that a network trained to identify street signs has strong performance on standard inputs, but identifies stop signs as speed limit signs when a special sticker is added to the stop sign. This backdoor provides adversaries a method for ensuring that any stop sign is misclassified simply by placing a sticker on it. Unlike adversarial samples that require specific, complex noise to be added to an image, backdoor triggers can be quite simple and can be easily applied to an image - as easily as adding a sticker to a sign or modifying a pixel.

ART provides filtering defences to detect poison this type of attacks.

8.1 The PoisonFilteringDefence Base Class

art.poison_detection.PoisonFilteringDefence art/poison_detection/

The abstract class PoisonFilteringDefence defines an interface for the detection of poison when the training data is available. This class takes a model and its corresponding training data and indentifies the set of data points that are suspected of being poisonous. The functions supported by this class are:

  • __init__(self, classifier, x_train, y_train, verbose=True), where:

    • classifier: corresponds to the trained model evaluated for poison

    • x_train: is the training data (features) used to train ‘classifier‘.

    • y_train: corresponds to the set of labels associate with x_train.

    • verbose: bool, ‘True‘ prints information about the analysis, by default it is set to ‘True‘.

  • detect_poison(self, **kwargs) -> ‘list‘
    Here, kwargs are defence-specific parameters used by child classes. This function returns a ‘list‘ with items identified as poison.

  • evaluate_defence(self, is_clean, **kwargs) -> JSON object

    If ground truth is known, this function returns a confusion matrix in the form of a JSON object. The parameters of this function are:

    • is_clean: 1-D array where is_clean[i]=1 means x_train[i] is clean and is_clean[i]=0 that it is poisonous.

    • kwargs: are defence-specific parameters used by child classes.

8.2 Poisoning Filter Using Activation Clustering Against Backdoor Attacks

art.poison_detection.ActivationDefence art/poison_detection/

The Activation Clustering defence detects poisonous data crafted to insert backdoors into neural networks. For this purpose, the defence takes the data used to train the provided classifier and analyses the differences in how a neural network decides on the classification of each data point in the training set. Concretely, each sample in the training set is classified and the activations of the last hidden layer are retained. These activations are segmented according to their labels. For each activation segment, the dimensionality is reduced, and then a clustering algorithm is applied. Experimental results suggest that poisonous and legitimate data separate into distinct clusters, akin to the way in which different areas of the brain light up on scans when subjected to different stimuli. Then each resulting cluster is analyzed for poison according to different heuristics, for example cluster size, cluster cohesiveness, among others. Analysts can also manually review the results. After poisonous data is identified, the model needs to be repaired accordingly.

Implementation details

The ActivationDefence class extends the PoisonFilteringDefence abstract class. Hence, it uses its constructor:

  • __init__(self, classifier, x_train, y_train, verbose=True), where:

    • classifier: corresponds to the trained model evaluated for poison

    • x_train: is the training data (features) used to train ‘classifier‘.

    • y_train: corresponds to the set of labels associate with x_train.

    • verbose: bool, ‘True‘ prints information about the analysis, by default it is set to ‘True‘.

  • detect_poison(self, **kwargs) -> ‘list‘
    The activation defense allows to set up the following parameters:

    • clustering_method: clustering algorithm to be used. Currently ‘KMeans‘ is the only method supported.

    • n_clusters: number of clusters to find. This value needs to be greater or equal to one.

    • reduce: method used to reduce dimensionality of the activations. Supported methods include ‘PCA‘, ‘FastICA‘ and ‘TSNE‘.

    • ndims: number dimensions to be reduced.

    • cluster_analysis: heuristic to automatically determine if a cluster contains poisonous data. Supported methods include ‘smaller‘ and ‘distance‘. The ‘smaller‘ method defines as poisonous the cluster with less number of data points, while the ‘distance‘ heuristic uses the distance between the clusters.

    When an ActivationDefence is instantiated, the following parameters are used by default: Dimensionality reduction is set to ‘PCA‘. The clustering method is set to ‘K-means‘ and the number of clusters to two. The default dimensionality reduction technique is ‘PCA‘ and the heuristic to analyze the cluster is ‘smaller‘ which assigns as poisonous the smaller cluster.

  • evaluate_defence(self, is_clean) -> JSON object
    If ground truth is known, this function returns a confusion matrix in the form of a JSON object. Here, is_clean: 1-D array where is_clean[i]=1 means x_train[i] is clean and is_clean[i]=0 that it is poisonous.

The following example shows how to use ActivationDefence.

{}] In [1]: from keras.models import load˙model from art.poison˙detection import ActivationDefence from art.utils import load˙mnist (x˙train, y˙train), (˙, ˙), ˙, ˙ = load˙mnist() model = load˙model('keras˙model.h5') classifier = KerasClassifier((0, 1), model=model) defence = ActivationDefence(classifier, x˙train, y˙train) confidence˙level, is˙clean˙lst = defence.detect˙poison(n˙clusters=2, ndims=10, reduce=”PCA”) confusion˙matrix = defence.evaluate˙defence(is˙clean˙lst)

Finally, it is possible to take the clusters generated by the defense and inspect them manually.

9 Metrics


For assessing the robustness of a classifier against adversarial attacks, possible metrics are: the average minimal perturbation that is required to get an input misclassified; the average sensitivity of the model’s loss function with respect to changes in the inputs; the average sensitivity of the model’s logits with respect to changes in the inputs, e.g. based on Lipschitz constants in the neighborhood of sample inputs. The art.metrics module implements several metrics to assess the robustness respectively vulnerability of a given classifier, either generally or with respect to specific attacks:

  • Empirical robustness (see Section 9.1)

  • Loss sensitivity (see Section 9.2)

  • CLEVER score (see Section 9.3).

9.1 Empirical Robustness

art.metrics.empirical_robustness art/

Empirical robustness assesses the robustness of a given classifier with respect to a specific attack and test data set. It is equivalent to the average minimal perturbation that the attacker needs to introduce for a successful attack, as introduced in [24]. Given a trained classifier , an untargeted attack and test data samples , let be the subset of indices for which , i.e. for which the attack was successful. Then the empirical robustness (ER) is defined as:

Here is the norm used in the creation of the adversarial samples (if applicable); the default value is .

Implementation details

The method for calculating the empirical robustness has the following signature:
empirical_robustness(classifier, x, attack_name, attack_params=None) -> float.

It returns the empirical robustness value as defined above. The input parameters are as follows:

  • classifier: The classifier to be assessed.

  • x: An np.ndarray with the test data .

  • attack_name: A string specifying the attack to be used. Currently, the only supported attack is ‘fgsm’ (Fast Gradient Sign Method, see Section 5.2).

  • attack_params: A dictionary with attack-specific parameters. If the attack has a norm attribute, then this is used as the in the calculation above; otherwise the standard Euclidean distance is used ().

9.2 Loss Sensitivity

art.metrics.loss_sensitivity art/

Local loss sensitivity aims to quantify the smoothness of a model by estimating its Lipschitz continuity constant, which measures the largest variation of a function under a small change in its input: the smaller the value, the smoother the function. This measure is estimated based on the gradients of the classifier logits, as considered e.g. in 

[17]. It is thus an attack-independent measure offering insight on the properties of the model.

Given a classifier and a test set , the loss sensitivity (LS) is defined as follows:

Implementation details

The method for calculating the loss sensitivity has the signature that follows and returns the loss sensitivity as defined above.
loss_sensitivity(classifier, x, y) -> float, where:

  • classifier is the classifier to be assessed

  • x is an np.ndarray with the test data

  • y is an np.ndarray with the one-hot encoded labels of .

9.3 Clever

art.metrics.clever_u art.metrics.clever_t art/

The Cross Lipschitz Extreme Value for nEtwork Robustness metric (CLEVER) [35] estimates, for a given input and norm, a lower bound for the minimal perturbation that is required to change the classification of , i.e.  implies (see Corollary 3.2.2 in [35]). The derivation of is based on a Lipschitz constant for the gradients of the classifier’s logits in an -ball with radius around . Since in general there is no closed-form expression or upper bound for this constant333Closed-form expressions for special types of classifier are derived in [14], which is the first work considering Lipschitz bounds for deriving formal guarantees of classifiers robustness against adversarial inputs., the CLEVER algorithm uses an estimate based on extreme value theory.

Algorithm 8 outlines the calculation of the CLEVER metric for targeted attacks: given a target class , the score is constructed such that as long as . Below we discuss how to adapt the algorithm to the untargeted case. A key step in the algorithm is outlined in lines 5-6

where the norm of gradient differences is evaluated at points randomly sampled from the uniform distribution on the

-ball with radius around . The set collects the maximum norms from batches of size each (line 8), and is the MLE of the location parameter of a reverse Weibull distribution given the realizations in . The final CLEVER score is calculated in line 11. Note that it is bounded by the radius of the -ball, as for any larger perturbations the estimated Lipschitz constant might not apply.

The CLEVER score for untargeted attacks is simply obtained by taking the minimum CLEVER score for targeted attacks over all classes with .

0:   : Input for which to calculate the CLEVER score : Target class : Number of batches over each of which the maximum gradient is computed : Number of samples per batch : Perturbation norm : Maximum norm of perturbations considered in the CLEVER score
3:  for  do
4:     for  do
7:     end for
9:  end for
11:   CLEVER score .
Algorithm 8 CLEVER score for targeted attack

Implementation details

The method for calculating the CLEVER score for targeted attacks has the following signature:
clever_t(classifier, x, target_class, nb_batches, batch_size, radius, norm,
c_init=1, pool_factor=10) -> float

Consider its parameters:

  • classifier, x and target_class are respectively the classifier , the input and the target class for which the CLEVER score is to be calculated.

  • nb_batches and batch_size are the number of batches and batch size , respectively.

  • radius is the radius .

  • norm is the perturbation norm .

  • c_init is a hyper-parameter for the MLE computation in line 10 (which is using the weibull_min function from scipy.stats).

  • pool_factor is an integer factor according to which a pool of pool_factor random samples from the -ball with radius and center is created and (re-)used in lines 5-6 for computational efficiency purposes.

The method clever_u for calculating the CLEVER score of untargeted attacks has the same signature as clever_t except that it doesn’t take the target_class parameter.

10 Versioning

The library uses semantic versioning§§§, meaning that version numbers take the form of MAJOR.MINOR.PATCH. Given such a version number, we increment the

  • MAJOR version when we make incompatible API changes,

  • MINOR version when we add functionality in a backwards-compatible manner, and

  • PATCH version when we make backwards-compatible bug fixes.

Consistent benchmark results can be obtained with ART under constant MAJOR.MINOR versions. Please report these when publishing experiments.


We would like to thank the following colleagues (in alphabetic order) for their contributions, advice, feedback and support: Vijay Arya, Pin-Yu Chen, Evelyn Duesterwald, David Kung, Taesung Lee, Sameep Mehta, Anupama Murthi, Biplav Srivastava, Deepak Vijaykeerthy, Jialong Zhang, Vladimir Zolotov.