An adversarial example library for constructing attacks, building defenses, and benchmarking both
cleverhans is a software library that provides standardized reference implementations of adversarial example construction techniques and adversarial training. The library may be used to develop more robust machine learning models and to provide standardized benchmarks of models' performance in the adversarial setting. Benchmarks constructed without a standardized implementation of adversarial example construction are not comparable to each other, because a good result may indicate a robust model or it may merely indicate a weak implementation of the adversarial example construction procedure. This technical report is structured as follows. Section <ref> provides an overview of adversarial examples in machine learning and of the cleverhans software. Section <ref> presents the core functionalities of the library: namely the attacks based on adversarial examples and defenses to improve the robustness of machine learning models to these attacks. Section <ref> describes how to report benchmark results using the library. Section <ref> describes the versioning system.READ FULL TEXT VIEW PDF
Adversarial examples are malicious inputs designed to fool machine learn...
Theano is a Python library that allows to define, optimize, and evaluate...
Adversarial examples are inputs to machine learning models designed to c...
Machine learning techniques are immensely deployed in both industry and
Algorithms are increasingly common components of high-impact decision-ma...
We investigate coresets - succinct, small summaries of large data sets -...
This technical report describes a new feature of the CleverHans library
An adversarial example library for constructing attacks, building defenses, and benchmarking both
Source for paper "Attacking Binarized Neural Networks"
Kaggle competition NIPS2017: Non-targeted Adversarial Attack - Imperceptibly transform images in ways that fool classification models
Adversarial examples are inputs crafted by making slight perturbations to legitimate inputs with the intent of misleading machine learning models . The perturbations are designed to be small in magnitude, such that a human observer would not have difficulty processing the resulting input. In many cases, the perturbation required to deceive a machine learning model is so small that a human being may not be able to perceive that anything has changed, or even so small that an 8-bit representation of the input values does not capture the perturbation used to fool a model that accepts 32-bit inputs. We invite readers unfamiliar with the concept to the detailed presentation in [18, 11, 17, 4]. Although completely effective defenses have yet to be proposed, the most successful to date is adversarial training [18, 11]
. Different sources of adversarial examples used in the training process can make adversarial training more effective; as of this writing, to the best of our knowledge, the most effective version of adversarial training on ImageNet is ensemble adversarial training
and the most effective version on MNIST is the basic iterative method applied to randomly chosen starting points .
The cleverhans library provides reference implementations of the attacks, which are intended for use for two purposes. First, machine learning developers may construct robust models by using adversarial training, which requires the construction of adversarial examples during the training procedure. Second, we encourage researchers who report the accuracy of their models in the adversarial setting to use the standardized reference implementation provided by cleverhans. Without a standard reference implementation, different benchmarks are not comparable—a benchmark reporting high accuracy might indicate a more robust model, but it might also indicate the use of a weaker attack implementation. By using cleverhans, researchers can be assured that a high accuracy on a benchmark corresponds to a robust model.
Implemented in TensorFlow, cleverhans is designed as a tool to help developers add defenses against adversarial examples to their models and benchmark the robustness of their models to adversarial examples. The interface for cleverhans
is designed to accept models implemented using any model framework ( such as Keras) or implemented without any specific model abstraction.
library is a collaboration is free, open-source software, licensed under the MIT license. The project is available online through GitHub111https://github.com/openaicleverhans. The main communication channel for developers of the library is a mailing list, whose discussions are publicly available online222https://groups.google.com/group/cleverhans-dev.
The library’s package is organized by modules. The most important modules are:
attacks: contains the Attack class, defining the interface used by all CleverHans attacks, as well as implementations of several specific attacks.
model: contains the Model class, which is a very lightweight class defining a simple interface that models should implement in order to be compatible with Attack. CleverHans includes a Model implementation for Keras Sequential models and examples of Model implementations for TensorFlow models that are not implemented using any modeling framework library.
In the following, we describe some of the research results behind the implementations made in cleverhans.
Adversarial example crafting algorithms implemented in cleverhans take a model, and an input, and return the corresponding adversarial example. Here are the algorithms currently implemented in the attacks module.
The L-BFGS method was introduced by Szegedy et al. . It aims to solve the following box-constrained optimization problem:
The computation is approximated by using box-constrained L-BFGS optimization.
The fast gradient sign method (FGSM) was introduced by Goodfellow et al. . The intuition behind the attack is to linearize the cost function used to train a model around the neighborhood of the training point that the adversary wants to force the misclassification of. The resulting adversarial example corresponding to input is computed as follows:
where is a parameter controlling the magnitude of the perturbation introduced. Larger values increase the likelihood that will be misclassified by , but make the perturbation easier to detect by a human.
The fast gradient sign method is available by calling attacks.fgsm()
The implementation defines the necessary graph elements and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. The implementation is parameterized by the parameterintroduced above. It is possible to configure the method to clip adversarial examples so that they are constrained to be part of the expected input domain range.
The Carlini-Wagner (C&W) attack was introduced by Carlini et al. . Inspired by , the authors formulate finding adversarial examples as an optimization problem; find some small change that can be made to an input that will change its classification, but so that the result is still in the valid range. They instantiate the distance metric with an norm, define a success function such that if and only if the model misclassifies, and minimize the sum with a trade-off constant ‘c’. ‘c’ is chosen by modified binary search, the box constraint is resolved by applying a change-of-variables, and the Adam  optimizer is used to solve the optimization instance.
The attack has been shown to be quite powerful [5, 6], however this power comes at the cost of speed, as this attack is often much slower than others. The attack can be sped up by fixing ‘c’ (instead of performing modified binary search).
The Carlini-Wagner attack is available by instantiating the attack object with attacks.CarliniWagnerL2 and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. As the name suggests, the norm used in the implementation is
. The attack is controlled by a number of parameters, namely the confidence, which defines the margin between logit values necessary to succeed, the learning rate (step-size), the number of binary search steps, the number of iterations per binary search step, and the initial ‘c’ value.
, finding adversarial examples is formulated as an optimization problem. The same loss function as used by the C&W attack is adopted, however instead of performingregularization, elastic-net regularization is performed, with controlling the trade-off between and . The iterative shrinkage-thresholding algorithm (ISTA) . ISTA can be viewed as a regular first-order optimization algorithm with an additional shrinkage-thresholding step on each iteration.
Notably, the C&W attack becomes a special case of the EAD formulation, with . However, one can view EAD as a robust version of the C&W method, as the ISTA operation shrinks a value of the adversarial example if the deviation to the original input is greater than , and leaves the value unchanged if the deviation is less than . Empirical results support this claim, demonstrating the attack’s ability to bypass strong detection schemes and succeed against robust adversarially trained models while still producing adversarial examples with minimal visual distortion [7, 22, 21, 13].
The Elastic Net Method is available by instantiating the attack object with attacks.ElasticNetMethod and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. The attack is controlled by a number of parameters, most of which are shared with the C&W attack, namely the confidence, which defines the margin between logit values necessary to succeed, the learning rate (step-size), the number of binary search steps, the number of iterations per binary search step, and the initial ‘c’ value. Additional parameters include , the elastic-net regularization constant, and the decision rule, whether to choose successful adversarial examples with minimal or elastic-net distortion.
The basic iterative method (BIM) was introduced by Kurakin et al. , and extends the “fast” gradient method by applying it multiple times with small step size, clipping values of intermediate results after each step to ensure that they are in an -neighborhood of the original input.
The basic iterative method is available by instantiating the attack object with attacks.BasicIterativeMethod and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. The attack is parameterized by , alike the fast gradient method, but also by the step-size for each attack iteration and the number of attack iterations.
The projected gradient descent (PGD) attack was introduced by Madry et al. . The authors state that the basic iterative method (BIM)  is essentially projected gradient descent on the negative loss function. To explore the loss landscape further, PGD is re-started from many points in the balls around the input examples.
PGD is available by instantiating the attack object with attacks.MadryEtAl and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. PGD shares many parameters with BIM, such as , the step-size for each attack iteration, and the number of attack iterations. An additional parameter is a boolean which specifies whether or not to add an initial random perturbation.
The momentum iterative method (MIM) was introduced by Dong et al. 
. It is a technique for accelerating gradient descent algorithms by accumulating a velocity vector in the gradient direction of the loss function across iterations. BIM with incorporated momentum applied to an ensemble of models won first place in both the NIPS 2017 Non-Targeted and Targeted Adversarial Attack Competitions.
The momentum iterative method is available by instantiating the attack object with attacks.MomentumIterativeMethod and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. MIM shares many parameters with BIM, such as , the step-size for each attack iteration, and the number of attack iterations. An additional parameter is a decay factor which can be applied to the momentum term.
The Jacobian-based saliency map approach (JSMA) was introduced by Papernot et al. . The method iteratively perturbs features of the input that have large adversarial saliency scores. Intuitively, this score reflects the adversarial goal of taking a sample away from its source class towards a chosen target class.
First, the adversary computes the Jacobian of the model and evaluates it in the current input: this returns a matrix where component is the derivative of class with respect to input feature . To compute the adversarial saliency map, the adversary then computes the following for each input feature :
where is the target class that the adversary wants the machine learning model to assign. The adversary then selects the
input feature with the largest saliency score and increases its value333In the original paper and the cleverhans implementation, input
features are selected by pairs using the same heuristic.
implementation, input features are selected by pairs using the same heuristic.. The process is repeated until misclassification in the target class is achieved or the maximum number of perturbed features has been reached.
In cleverhans, the Jacobian-based saliency map approach may be called with attacks.jsma(). The implementation returns the adversarial example directly, as well as whether the target class was achieved or not, and how many input features were perturbed.
DeepFool was introduced by Moosavi-Dezfooli et al. 
. Unlike most of the attacks described here, it cannot be used in the targeted case, where the attacker specifies what target class the model should classify the adversarial example as. It can only be used in the non-targeted case, where the attacker can only ensure that the the model classifies the adversarial example in a class different from the original.
Inspired by the fact that the corresponding separating hyperplanes in linear classifiers indicate the decision boundaries of each class, DeepFool aims to find the least distortion (in terms of euclidean distance) leading to misclassification by projecting the input example to the closest separating hyperplane. An approximate iterative algorithm is proposed for attacking neural networks in order to tackle its inherent nonlinearities.
DeepFool is available by instantiating the attack object with attacks.DeepFool and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. DeepFool has a few parameters, such as the number of classes to test against, a termination criterion to prevent vanishing updates, and the maximum number of iterations.
Feature Adversaries were introduced by Sabour et al. . Instead of solely considering adversaries which disrupt classification, termed label adversaries, the authors considered adversarial examples which are confused with other examples not just in class label, but in their internal representations as well. Such examples are generated by feature adversaries.
Such feature adversarial examples are generated by minimizing the euclidean distance between the internal deep representation (at a specified layer) while constraining the distance between the input and adversarial example in terms of to be less than . The optimization is conducted using box-constrained L-BFGS.
Feature adversaries are available by instantiating the attack object with attacks.FastFeatureAdversaries and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. The implementation is parameterized by the following set of parameters: , the step-size for each attack iteration, the number of attack iterations, and the layer to target.
Simultaneous perturbation stochastic approximation (SPSA) was introduced by Uesato et al. 
. SPSA is a gradient-free optimization method, which is useful when the model is non-differentiable, or more generally, the gradients do not point in useful directions. Gradients are approximated using finite difference estimates in random directions.
SPSA is available by instantiating the attack object with attacks.SPSA and then calling the generate() function. This generates the symbolic graph and returns a tensor, which once evaluated holds the value of the adversarial example corresponding to the input provided. The implementation is parameterized by the following set of parameters: , the number of optimization steps, the learning rate (step-size), and the perturbation size used for the finite difference approximation.
The intuition behind defenses against adversarial examples is to make the model smoother by limiting its sensitivity to small perturbations of its inputs (and therefore making adversarial examples harder to craft). Since all defenses currently proposed modify the learning algorithm used to train the model, we implement them in the modules of cleverhans that contain the functions used to train models. In module utils_tf, the following defenses are implemented.
The intuition behind adversarial training [18, 11] is to inject adversarial examples during training to improve the generalization of the machine learning model. To achieve this effect, the training function tf_model_train() implemented in module utils_tf can be given the tensor definition for an adversarial example: e.g., the one returned by the method described in Section 2.1.2. When such a tensor is given, the training algorithm modifies the loss function used to optimize the model parameters: it is in that case defined as the average between the loss for predictions on legitimate inputs and the loss for predictions made on adversarial examples. The remainder of the training algorithm is left unchanged.
This section provides instructions for how to preprare and report benchmark results.
When comparing against previously published benchmarks, it is best to to use the same version of cleverhans as was used to produce the previous benchmarks. This minimizes the possibility that an undetected change in behavior between versions could cause a difference in the output of the benchmark results.
When reporting new results that are not directly compared to previous work, it is best to use the most recent versioned release of cleverhans.
In all cases, it is important to report the version number of cleverhans.
In addition to this information, one should also report which attack methods were used, and the values of any configuration parameters used for these attacks.
For example, you might report “We benchmarked the robustness of our method to adversarial attack using v2.1.0 of CleverHans (Papernot et al. 2018). On a test set modified by fgsm with eps of 0.3, we obtained a test set accuracy of 97.9%.”
The library does not provide specific test datasets or data preprocessing. End users are responsible for appropriately preparing the data in their specific application areas, and for reporting sufficient information about the data preprocessing and model family to make benchmarks appropriately comparable.
Because one of the goals of cleverhans is to provide a basis for reproducible benchmarks, it is important that the version numbers provide useful information. The library uses semantic versioning,444http://semver.org/ meaning that version numbers take the form of MAJOR.MINOR.PATCH.
The PATCH number increments whenever backwards-compatible bug fixes are made. For the purpose of this library, a bug is not considered backwards-compatible if it changes the results of a benchmark test. The MINOR number increments whenever new features are added in a backwards-compatible manner. The MAJOR number increments whenever an interface changes.
Any time a bug in CleverHans affects the accuracy of any performance number reported as a benchmark result, we consider fixing the bug to constitute an API change (to the interface mapping from the specification of a benchmark experiment to the reported performance) and increment the MAJOR version number when we make the next release. For this reason, when writing academic articles, it is important to compare CleverHans benchmark results that were produced with the same MAJOR version number. Release notes accompanying each revision indicate whether an increment to the MAJOR number invalidates earlier benchmark results or not.
Release notes for each version are available at https://github.com/tensorflow/cleverhans/releases
The format of this report was in part inspired by . Nicolas Papernot is supported by a Google PhD Fellowship in Security. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
The limitations of deep learning in adversarial settings.In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387. IEEE, 2016.