My exploration of Adversarial Examples in Neural Networks
Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97 modifying on average 4.02 the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.READ FULL TEXT VIEW PDF
Deep Neural Networks (DNNs) are commonly used for various traffic analys...
Deep learning algorithms have been shown to perform extremely well on ma...
Artificial neural networks in general and deep learning networks in
Despite their unprecedented performance in various domains, utilization ...
An intriguing property of deep neural networks is their inherent
Deep neural networks (DNNs) enable innovative applications of machine
We leverage what are typically considered the worst qualities of deep
My exploration of Adversarial Examples in Neural Networks
Large neural networks, recast as deep neural networks (DNNs) in the mid 2000s, altered the machine learning landscape by outperforming other approaches in many tasks. This was made possible by advances that reduced the computational complexity of training . For instance, Deep learning (DL) can now take advantage of large datasets to achieve accuracy rates higher than previous classification techniques. In short, DL is transforming computational processing of complex data in many domains such as vision [24, 37], speech recognition [15, 32, 33], language processing , financial fraud detection , and recently malware detection .
This increasing use of deep learning is creating incentives for adversaries to manipulate DNNs to force misclassification of inputs. For instance, applications of deep learning use image classifiers to distinguish inappropriate from appropriate content, and text and image classifiers to differentiate between SPAM and non-SPAM email. An adversary able to craft misclassified inputs would profit from evading detection–indeed such attacks occur today on non-DL classification systems [6, 7, 22]. In the physical domain, consider a driverless car system that uses DL to identify traffic signs . If slightly altering “STOP” signs causes DNNs to misclassify them, the car would not stop, thus subverting the car’s safety.
An adversarial sample is an input crafted to cause deep learning algorithms to misclassify. Note that adversarial samples are created at test time, after the DNN has been trained by the defender, and do not require any alteration of the training process. Figure 1 shows examples of adversarial samples taken from our validation experiments. It shows how an image originally showing a digit can be altered to force a DNN to classify it as another digit. Adversarial samples are created from benign samples by adding distortions exploiting the imperfect generalization learned by DNNs from finite training sets , and the underlying linearity of most components used to build DNNs . Previous work explored DNN properties that could be used to craft adversarial samples [18, 30, 36]. Simply put, these techniques exploit gradients computed by network training algorithms: instead of using these gradients to update network parameters as would normally be done, gradients are used to update the original input itself, which is subsequently misclassified by DNNs.
In this paper, we describe a new class of algorithms for adversarial sample creation against any feedforward (acyclic) DNN 
and formalize the threat model space of deep learning with respect to the integrity of output classification. Unlike previous approaches mentioned above, we compute a direct mapping from the input to the output to achieve an explicit adversarial goal. Furthermore, our approach only alters a (frequently small) fraction of input features leading to reduced perturbation of the source inputs. It also enables adversaries to apply heuristic searches to find perturbations leading to input targeted misclassifications (perturbing inputs to result in a specific output classification).
More formally, a DNN models a multidimensional function where
is a (raw) feature vector andis an output vector. We construct an adversarial sample from a benign sample by adding a perturbation vector solving the following optimization problem:
where is the adversarial sample, is the desired adversarial output, and is a norm appropriate to compare the DNN inputs. Solving this problem is non-trivial, as properties of DNNs make it non-linear and non-convex . Thus, we craft adversarial samples by constructing a mapping from input perturbations to output variations. Note that all research mentioned above took the opposite approach: it used output variations to find corresponding input perturbations. Our understanding of how changes made to inputs affect a DNN’s output stems from the evaluation of the forward derivative: a matrix we introduce and define as the Jacobian of the function learned by the DNN. The forward derivative is used to construct adversarial saliency maps indicating input features to include in perturbation in order to produce adversarial samples inducing a certain behavior from the DNN.
Forward derivatives approaches are much more powerful than gradient descent techniques used in prior systems. They are applicable to both supervised and unsupervised architectures and allow adversaries to generate information for broad families of adversarial samples. Indeed, adversarial saliency maps are versatile tools based on the forward derivative and designed with adversarial goals in mind, giving greater control to adversaries with respect to the choice of perturbations. In our work, we consider the following questions to formalize the security of DL in adversarial settings: (1) “What is the minimal knowledge required to perform attacks against DL?”, (2) “How can vulnerable or resistant samples be identified?”, and (3) “How are adversarial samples perceived by humans?”.
The adversarial sample generation algorithms are validated using the widely studied LeNet architecture (a pioneering DNN used for hand-written digit recognition ) and MNIST dataset . We show that any input sample can be perturbed to be misclassified as any target class with success while perturbing on average of the input features per sample. The computational costs of the sample generation are modest; samples were each generated in less than a second in our setup. Lastly, we study the impact of our algorithmic parameters on distortion and human perception of samples. This paper makes the following contributions:
We formalize the space of adversaries against classification DNNs with respect to adversarial goal and capabilities. Here, we provide a better understanding of how attacker capabilities constrain attack strategies and goals.
We introduce a new class of algorithms for crafting adversarial samples solely by using knowledge of the DNN architecture. These algorithms (1) exploit forward derivatives that inform the learned behavior of DNNs, and (2) build adversarial saliency maps enabling an efficient exploration of the adversarial-samples search space.
We validate the algorithms using a widely used computer vision DNN. We define and measure sample distortion and source-to-target hardness, and explore defenses against adversarial samples. We conclude by studying human perception of distorted samples.
Classical threat models enumerate the goals and capabilities of adversaries in a target domain . This section taxonimizes threat models in deep learning systems and positions several previous works with respect to the strength of the modeled adversary. We begin by providing an overview of deep neural networks highlighting their inputs, outputs and function. We then consider the taxonomy presented in Figure 2.
Deep neural networks are large neural networks organized into layers
of neurons, corresponding to successive representations of the input data. Aneuron is an individual computing unit transmitting to other neurons the result of the application of its activation function on its input. Neurons are connected by links with different weights and biases characterizing the strength between neuron pairs. Weights and biases can be viewed as DNN parameters used for information storage. We define a network architecture
to include knowledge of the network topology, neuron activation functions, as well as weight and bias values. Weights and biases are determined duringtraining by finding values that minimize a cost function evaluated over the training data . Network training is traditionally done by gradient descent using backpropagation .
Deep learning can be partitioned in two categories, depending on whether DNNs are trained in a supervised or unsupervised manner . Supervised training leads to models that map unseen samples using a function inferred from labeled training data. On the contrary, unsupervised training learns representations of unlabeled training data, and resulting DNNs can be used to generate new samples, or to automate feature engineering by acting as a pre-processing layer for larger DNNs. We restrict ourselves to the problem of learning multi-class classifiers in supervised settings. These DNNs are given an input
and output a class probability vector. Note that our work remains valid for unsupervised-trained DNNs, and leaves a detailed study of this issue for future work.
Figure 3 illustrates an example shallow feedforward neural network.111A shallow neural network is a small neural network that operates (albeit at a smaller scale) identically to the DL networks considered throughout. The network has two input neurons and , a hidden layer with two neurons and , and a single output neuron
. In other words, it is a simple multi-layer perceptron. Both input neuronsand take real values in and correspond to the network input: a feature vector
. Hidden layer neurons each use the logistic sigmoid functionas their activation function. This function is frequently used in neural networks because it is continuous (and differentiable), demonstrates linear-like behavior around , and saturates as the input goes to . Neurons in the hidden layers apply the sigmoid to the weighted input layer: for instance, neuron computes with where and are weights and a bias. Similarly, the output neuron applies the sigmoid function to the weighted output of the hidden layer where . Weight and bias values are determined during training. Thus, the overall behavior of the network learned during training can be modeled as a function: .
Threats are defined with a specific function to be protected/defended. In the case of deep learning systems, the integrity of the classification is of paramount importance. Specifically, an adversary of a deep learning system seeks to provide an input that results in an incorrect output classification. The nature of the incorrectness represents the adversarial goal, as identified in the X-axis of Figure 2. Consider four goals that impact classifier output integrity:
Confidence reduction - reduce the output confidence classification (thereby introducing class ambiguity)
Misclassification - alter the output classification to any class different from the original class
Targeted misclassification - produce inputs that force the output classification to be a specific target class. Continuing the example illustrated in Figure 1, the adversary would create a set of speckles classified as a digit.
Source/target misclassification - force the output classification of a specific input to be a specific target class. Continuing the example from Figure 1, adversaries take an existing image of a digit and add a small number of speckles to classify the resulting image as another digit.
The scientific community recently started exploring adversarial deep learning. Previous work on other machine learning techniques is referenced later in Section VII.
Szegedy et al., introduced a system that generates adversarial samples by perturbing inputs in a way that creates source/target misclassifications . The perturbations made by their work, which focused on a computer vision application, are not distinguishable by humans – for example, small but carefully-crafted perturbations to an image of a vehicle resulted in the DNN classifying it as an ostrich. The authors named this modified input an adversarial image, which can be generalized as part of a broader definition of adversarial samples. When producing adversarial samples, the adversary’s goal is to generate inputs that are correctly classified (or not distinguishable) by humans or other classifiers, but are misclassified by the targeted DNN.
Another example is due to Nguyen et al., who presented a method for producing images that are unrecognizable to humans, but are nonetheless labeled as recognizable objects by DNNs . For instance, they demonstrated how a DNN will classify a noise-filled image constructed using their technique as a television with high confidence. They named the images produced by this method fooling images. Here, a fooling image is one that does not have a source class but is crafted solely to perform a targeted misclassification attack.
Adversaries are defined by the information and capabilities at their disposal. The following (and the Y-axis of Figure 2) describes a range of adversaries loosely organized by decreasing adversarial strength (and increasing attack difficulty). Note that we only considers attack conducted at test time, any tampering of the training procedure is outside the scope of this paper.
Training data and network architecture - This adversary has perfect knowledge of the neural network used for classification. The attacker has to access the training data , functions and algorithms used for network training, and is able to extract knowledge about the DNN’s architecture
. This includes the number and type of layers, the activation functions of neurons, as well as weight and bias matrices. He also knows which algorithm was used to train the network, including the associated loss function. This is the strongest adversary that can analyze the training data and simulate the deep neural network in toto.
Network architecture - This adversary has knowledge of the network architecture and its parameter values. For instance, this corresponds to an adversary who can collect information about both (1) the layers and activation functions used to design the neural network, and (2) the weights and biases resulting from the training phase. This gives the adversary enough information to simulate the network. Our algorithms assume this threat model, and show a new class of algorithms that generate adversarial samples for supervised and unsupervised feedforward DNNs.
Training data - This adversary is able to collect a surrogate dataset, sampled from the same distribution that the original dataset used to train the DNN. However, the attacker is not aware of the architecture used to design the neural network. Thus, typical attacks conducted in this model would likely include training commonly deployed deep learning architectures using the surrogate dataset to approximate the model learned by the legitimate classifier.
Oracle - This adversary has the ability to use the neural network (or a proxy of it) as an “oracle”. Here the adversary can obtain output classifications from supplied inputs (much like a chosen-plaintext attack in cryptography). This enables differential attacks, where the adversary can observe the relationship between changes in inputs and outputs (continuing with the analogy, such as used in differential cryptanalysis) to adaptively craft adversarial samples. This adversary can be further parameterized by the number of absolute or rate-limited input/output trials they may perform.
Samples - This adversary has the ability to collect pairs of input and output related to the neural network classifier. However, he cannot modify these inputs to observe the difference in the output. To continue the cryptanalysis analogy, this threat model would correspond to a known plaintext attack. These pairs are largely labeled output data, and intuition states that they would most likely only be useful in very large quantities.
In this section, we present a general algorithm for modifying samples so that a DNN yields any adversarial output. We later validate this algorithm by having a classifier misclassify samples into a chosen target class. This algorithm captures adversaries crafting samples in the setting corresponding to the upper right-hand corner of Figure 2. We show that knowledge of the architecture and weight parameters222This means that the algorithm does not require knowledge of the dataset used to train the DNN. Instead, it exploits knowledge of trained parameters. is sufficient to derive adversarial samples against acyclic feedforward DNNs. This requires evaluating the DNN’s forward derivative in order to construct an adversarial saliency map that identifies the set of input features relevant to the adversary’s goal. Perturbing the features identified in this way quickly leads to the desired adversarial output, for instance misclassification. Although we describe our approach with supervised neural networks used as classifiers, it also applies to unsupervised architectures.
Recall the simple architecture introduced previously in section II and illustrated in Figure 3. Its low dimensionality allows us to better understand the underlying concepts behind our algorithms. We indeed show how small input perturbations found using the forward derivative can induce large variations of the neural network output. Assuming that input biases , , and are null, we train this toy network to learn the AND function: the desired output is with . Note that non-integer inputs are rounded up to the closest integer, thus we have for instance or . Using backpropagation on a set of 1,000 samples, corresponding to each case of the function (, , , and
), we train for 100 epochs using a learning rate. The overall function learned by the neural network is plotted on Figure 4 for input values . The horizontal axes represent the 2 input dimensions and while the vertical axis represents the network output corresponding to .
We are now going to demonstrate how to craft adversarial samples on this neural network. The adversary considers a legitimate sample , classified as by the network, and wants to craft an adversarial sample very similar to , but misclassified as . Recall, that we formalized this problem as:
where is the adversarial sample, is the desired adversarial output, and is a norm appropriate to compare points in the input domain. Informally, the adversary is searching for small perturbations of the input that will incur a modification of the output into . Finding these perturbations can be done using optimization techniques, simple heuristics, or even brute force. However such solutions are hard to implement for deep neural networks because of non-convexity and non-linearity . Instead, we propose a systematic approach stemming from the forward derivative.
We define the forward derivative as the Jacobian matrix of the function learned by the neural network during training. For this example, the output of is one dimensional, the matrix is therefore reduced to a vector:
Both components of this vector are computable using the adversary’s knowledge, and later we show how to compute this term efficiently. The forward derivative for our example network is illustrated in Figure 5, which plots the gradient for the second component on the vertical axis against and on the horizontal axes. We omit the plot for because is approximately symmetric on its two inputs, making the first component redundant for our purposes. This plot makes it easy to visualize the divide between the network’s two possible outputs in terms of values assigned to the input feature : 0 to the left of the spike, and 1 to its left. Notice that this aligns with Figure 4, and gives us the information needed to achieve our adversarial goal: find input perturbations that drive the output closer to a desired value.
Consulting Figure 5 alongside our example network, we can confirm this intuition by looking at a few sample points. Consider and , which are both located near the spike in Figure 5. Although they only differ by a small amount (), they cause a significant change in the network’s output, as and . Recalling that we round the inputs and outputs of this network so that it agrees with the Boolean AND function, we see that X* is an adversarial sample: after rounding, and . Just as importantly, the forward derivative tells us which input regions are unlikely to yield adversarial samples, and are thus more immune to adversarial manipulations. Notice in Figure 5 that when either input is close to 0, the forward derivative is small. This aligns with our intuition that it will be more difficult to find adversarial samples close to than . This tells the adversary to focus on features corresponding to larger forward derivative values in a given input when constructing a sample, making his search more efficient and ultimately leading to smaller overall distortions.
The takeaways of this example are thereby: (1) small input variations can lead to extreme variations of the output of the neural network, (2) not all regions from the input domain are conducive to find adversarial samples, and (3) the forward derivative reduces the adversarial-sample search space.
We now generalize this approach to any feedforward DNN, using the same assumptions and adversary model from Section III-A. The only assumptions we make on the architecture are that its neurons form an acyclic DNN, and each use a differentiable activation function. Note that this last assumption is not limiting because the back-propagation algorithm imposes the same requirement. In Figure 6, we give an example of a feedforward deep neural network architecture and define some notations used throughout the remainder of the paper. Most importantly, the -dimensional function learned by the DNN during training assigns an output when given an -dimensional input . We write the number of hidden layers. Layers are indexed by such that is the index of the input layer, corresponds to hidden layers, and indexes the output layer.
Algorithm 1 shows our process for constructing adversarial samples. As input, the algorithm takes a benign sample , a target output , an acyclic feedforward DNN , a maximum distortion parameter , and a feature variation parameter . It returns new adversarial sample such that , and proceeds in three basic steps: (1) compute the forward derivative , (2) construct a saliency map based on the derivative, and (3) modify an input feature by . This process is repeated until the network outputs or the maximum distortion is reached. We now detail each step.
The first step is to compute the forward derivative for the given sample . As introduced previously, this is given by:
This is essentially the Jacobian of the function corresponding to what the neural network learned during training. The forward derivative computes gradients that are similar to those computed for backpropagation, but with two important distinctions: we take the derivative of the network directly, rather than on its cost function, and we differentiate with respect to the input features rather than the network parameters. As a consequence, instead of propagating gradients backwards, we choose in our approach to propagate them forward, as this allows us to find input components that lead to significant changes in network outputs.
Our goal is to express in terms of and constant values only. To simplify our expressions, we now consider one element of the forward derivative matrix defined in Equation 3: that is the derivative of one output neuron according to one input dimension . Of course our results are true for any matrix element. We start at the first hidden layer of the neural network. We can differentiate the output of this first hidden layer in terms of the input components. We then recursively differentiate each hidden layer in terms of the previous one:
where is the output vector of hidden layer and is the activation function of output neuron in layer . Each neuron on a hidden or output layer indexed is connected to the previous layer using weights defined in vector . By defining the weight matrix accordingly, we can define fully or sparsely connected interlayers, thus modeling a variety of architectures. Similarly, we write the bias for neuron of layer
. By applying the chain rule, we can write a series of formulae for:
We are thus able to express . We know that output neuron computes the following expression:
Thus, we apply the chain rule again to obtain:
In this formula, according to our threat model, all terms are known but one: . This is precisely the term we computed recursively. By plugging these results for successive layers back in Equation 6, we get an expression of component of the DNN’s forward derivative. Hence, the forward derivative of a network F can be computed for any input X by successively differentiating layers starting from the input layer until the output layer is reached. We later discuss in our methodology evaluation the computability of for state-of-the-art DNN architectures. Notably, the forward derivative can be computed using symbolic differentiation.
We extend saliency maps previously introduced as visualization tools  to construct adversarial saliency maps. These maps indicate which input features an adversary should perturb in order to effect the desired changes in network output most efficiently, and are thus versatile tools that allow adversaries to generate broad classes of adversarial samples.
Adversarial saliency maps are defined to suit problem-specific adversarial goals. For instance, we later study a network used as a classifier, its output is a probability vector across classes, where the final predicted class value corresponds to the component with the highest probability:
In our case, the saliency map is therefore based on the forward derivative, as this gives the adversary the information needed to cause the neural network to misclassify a given sample. More precisely, the adversary wants to misclassify a sample such that it is assigned a target class . To do so, the probability of target class given by , , must be increased while the probabilities of all other classes decrease, until . The adversary can accomplish this by increasing input features using the following saliency map :
where is an input feature. The condition specified on the first line rejects input components with a negative target derivative or an overall positive derivative on other classes. Indeed, should be positive in order for to increase when feature increases. Similarly, needs to be negative to decrease or stay constant when feature is increased. The product on the second line allows us to consider all other forward derivative components together in such a way that we can easily compare for all input features. In summary, high values of correspond to input features that will either increase the target class, or decrease other classes significantly, or both. By increasing these features, the adversary eventually misclassifies the sample into the target class. A saliency map example is shown on Figure 7.
It is possible to define other adversarial saliency maps using the forward derivative, and the quality of the map can have a large impact on the amount of distortion that Algorithm 1 introduces; we will study this in more detail later. Before moving on, we introduce an additional map that acts as a counterpart to the one given in Equation 8 by finding features that the adversary should decrease to achieve misclassification. The only difference lies in the constraints placed on the forward derivative values and the location of the absolute value in the second line:
Once an input feature has been identified by an adversarial saliency map, it needs to be perturbed to realize the adversary’s goal. This is the last step in each iteration of Algorithm 1, and the amount by which the selected feature is perturbed ( in Algorithm 1) is also problem-specific. We discuss in Section IV how this parameter should be set in an application to computer vision. Lastly, the maximum number of iterations, which is equivalent to the maximum distortion allowed in a sample, is specified by parameter . It limits the number of features changed to craft an adversarial sample and can take any positive integer value smaller than the number of features. Finding the right value for requires considering the impact of distortion on humans’ perception of adversarial samples – too much distortion might cause adversarial samples to be easily identified by humans.
We formally described a class of algorithms for crafting adversarial samples misclassified by feedforward DNNs using three tools: the forward derivative, adversarial saliency maps, and the crafting algorithm. We now apply these tools to a DNN used for a computer vision classification task: handwritten digit recognition. We show that our algorithms successfully craft adversarial samples from any source class to any given target class, which for this application means that any digit can be perturbed so that it is misclassified as any other digit.
are heavily reliant on convolutional layers introduced in the LeNet architecture, thus making LeNet a relevant DNN to validate our approach. We have no reason to believe that our method will not perform well on larger architectures. The network input is black and white images (28x28 pixels) of handwritten digits, which are flattened as vectors of 784 features, where each feature corresponds to a pixel intensity taking normalized values between 0 and 1. This input is processed by a succession of a convolutional layer (20 then 50 kernels of 5x5 pixels) and a pooling layer (2x2 filters) repeated twice, a fully connected hidden layer (500 neurons), and an output softmax layer (10 neurons). The output is a 10 class probability vector, where each class corresponds to a digit from 0 to 9, as shown in Figure8. The network then labels the input image with the class assigned the maximum probability, as shown in Equation 7. We train our network using the MNIST training dataset of 60,000 samples .
We attempt to determine whether, using the theoretical framework introduced in previous sections, we can effectively craft adversarial samples misclassified by the DNN. For instance, if we have an image of a handwritten digit 0 classified by the network as and the adversary wishes to craft an adversarial sample based on this image classified as , the source class is 0 and the target class is 7. Ideally, the crafting process must find the smallest perturbation required to construct the adversarial sample . A perturbation is a set of pixel intensities – or input feature variations – that are added to in order to craft . Note that perturbations introduced to craft adversarial samples must remain indistinguishable to humans.
Algorithm 2 shows the crafting algorithm used in our experiments, which we implemented in Python (see Appendix -A for more information regarding the implementation). It is based on Algorithm 1, but several details have been changed to accommodate our handwritten digit recognition problem. Given a network , Algorithm 2 iteratively modifies a sample by perturbing two input features (i.e., pixel intensities) and selected by saliency_map. The saliency map is constructed and updated between each iteration of the algorithm using the DNN’s forward derivative . The algorithm halts when one of the following conditions is met: (1) the adversarial sample is classified by the DNN with the target class , (2) the maximum number of iterations max_iter has been reached, or (3) the feature search domain is empty. The crafting algorithm is fine-tuned by three parameters:
Maximum distortion : this defines when the algorithm should stop modifying the sample in order to reach the adversarial target class. The maximum distortion, expressed as a percentage, corresponds to the maximum number of pixels to be modified when crafting the adversarial sample, and thus sets the maximum number of iterations max_iter (2 pixels modified per iteration) as follows:
where is the number of pixels in a sample.
Saliency map: subroutine saliency_map generates a map defining which input features will be modified at each iteration. Policies used to generate saliency maps vary with the nature of the data handled by the considered DNN, as well as the adversarial goals. We provide a subroutine example later in Algorithm 3.
Feature variation per iteration : once input features have been selected using the saliency map, they must be modified. The variation introduced to these features is another parameter that the adversary must set, in accordance with the saliency maps she uses.
The problem of finding good values for these parameters is a goal of our current evaluation, and is discussed later in Section V. For now, note that human perception is a limiting factor as it limits the acceptable maximum distortion and feature variation introduced. We now show the application of our framework with two different adversarial strategies.
The first strategy to craft adversarial samples is based on increasing the intensity of some pixels. To achieve this purpose, we consider 10 samples of handwritten digits from the MNIST test set, one from each digit class 0 to 9. We use this small subset of samples to illustrate our techniques. We scale up the evaluation to the entire dataset in Section V. Our goal is to report whether we can reach any adversarial target class for a given source class. For instance, if we are given a handwritten 0, we increase some of the pixel intensities to produce 9 adversarial samples respectively classified in each of the classes 1 to 9. All pixel intensities changed are increased by . We discuss this choice of parameter in section V. We allow for an unlimited maximum distortion . We simply measure for each of the 90 source-target class pairs whether an adversarial sample can be produced or not.
The adversarial saliency map used in the crafting algorithm to select pixel pairs that can be increased is an application of the map introduced in the general case of classification in Equation 8. The map aims to find pairs of pixels using the following heuristic:
where is the index of the target class, the left operand of the multiplication operation is constrained to be positive, and the right operand of the multiplication operation is constrained to be negative. This heuristic, introduced in the previous section of this manuscript, searches for pairs of pixels producing an increase in the target class output while reducing the sum of the output of all other classes when simultaneously increased. The pseudocode of the corresponding subroutine saliency_map is given in Algorithm 3.
The saliency map considers pairs of pixels and not individual pixels because selecting pixels one at a time is too strict, and very few pixels would meet the heuristic search criteria described in Equation 8. Searching for pairs of pixels is more likely to match the condition because one of the pixels can compensate a minor flaw of the other pixel. Let’s consider a simple example: has a target derivative of but a sum of other classes derivatives equal to , while as a target derivative equal to and a sum of other classes derivatives equal to . Individually, these pixels do not match the saliency map’s criteria stated in Equation 8, but combined, the pair does match the saliency criteria defined in Equation 10. One would also envision considering larger groups of input features to define saliency maps. However, this comes at a greater computational cost because more combinations need to be considered each time the group size is increased.
In our implementation of these algorithms, we compute the forward derivative of the network using the last hidden layer instead of the output probability layer. This is justified by the extreme variations introduced by the logistic regression computed between these two layers to ensure probabilities sum up to 1, leading to extreme derivative values. This reduces the quality of information on how the neurons are activated by different inputs and causes the forward derivative to loose accuracy when generating saliency maps. Better results are achieved when working with the last hidden layer, also made up of 10 neurons, each corresponding to one digit class 0 to 9. This justifies enforcing constraints on the forward derivative. Indeed, as the output layer used for computing the forward derivative does not sum up to 1, increasingdoes not imply that will decrease, and vice-versa.
The algorithm is able to craft successful adversarial samples for all 90 source-target class pairs. Figure 1 shows the 90 adversarial samples obtained as well as the 10 original samples used to craft them. The original samples are found on the diagonal. A sample on row and column , when , is a sample crafted from an image originally classified as source class to be misclassified as target class .
To verify the validity of our algorithms, and more specifically of our adversarial saliency maps, we run a simple experiment. We run the crafting algorithm on an empty input (all pixels initially set to an intensity of ) and craft one adversarial sample for each class from 0 to 9. The different samples shown in Figure 9 demonstrate how adversarial saliency maps are able to identify input features relevant to classification in a class.
Instead of increasing pixel intensities to achieve the adversarial targets, the second adversarial strategy decreases pixel intensities by . The implementation is identical to the exception of the adversarial saliency map. The formula is the same as previously written in Equation 10 but the constraints are different: the left operand of the multiplication operation is now constrained to be negative, and the right operand to be positive. This heuristic, also introduced in the previous section of this paper, searches for pairs of pixels producing an increase in the target class output while reducing the sum of the output of all other classes when simultaneously decreased.
The algorithm is once again able to craft successful adversarial samples for all source-target class pairs. Figure 10 shows the 90 adversarial samples obtained as well as the 10 original samples used to craft them. One observation to be made is that the distortion introduced by reducing pixel intensities seems harder to detect by the human eye. We address the human perception aspect with a study later in Section V.
We now use our experimental setup to answer the following questions: (1) “Can we exploit any sample?”, (2) “How can we identify samples more vulnerable than others?” and (3) “How do humans perceive adversarial samples compared to DNNs?”. Our primary result is that adversarial samples can be crafted reliably for our validation problem with a success rate by modifying samples on average by . We define a hardness measure to identify sample classes easier to exploit than others. This measure is necessary for designing robust defenses. We also found that humans cannot perceive the perturbation introduced to craft adversarial samples misclassified by the DNN: they still correctly classify adversarial samples crafted with a distortion smaller than .
Now that we previously showed the feasibility of crafting adversarial samples for all source-target class pairs, we seek to measure whether the crafting algorithm can successfully handle large quantities of distinct samples of hand-written digits. That is, we now design a set of experiments to evaluate whether or not all legitimate samples in the MNIST dataset can be exploited by an adversary to produce adversarial samples. We run our crafting algorithm on three sets of 10,000 samples each extracted from one of the three MNIST training, validation, and test subsets333Note that we extracted original samples from the dataset for convenience. Any sample can be used as an input to the adversarial crafting algorithm.. For each of these samples, we craft 9 adversarial samples, each of them classified in one of the 9 target classes distinct from the original legitimate class. Thus, we generate 90,000 samples for each set, leading to a total of 270,000 adversarial samples. We set the maximum distortion to and pixel intensities are increased by . The maximum distortion was fixed after studying the effect of increasing it on the success rate . We found that of the adversarial samples could be crafted with a distortion of less than and observed that the success rate did not increase significantly for larger maximum distortions. Parameter was set to after observing that decreasing it or giving it negative values increased the number of features modified, whereas we were interested in reducing the number of features altered during crafting. One will also notice that because features are normalized between 0 and 1, if we introduce a variation of , we always set pixels to their maximum value 1. This justifies why in Algorithm 2, we remove modified pixels from the search space at the end of each iteration. The impact on performance is beneficial, as we reduce the size of the feature search space at each iteration. In other words, our algorithm performs a best-first heuristic search without backtracking.
We measure the success rate and distortion of adversarial samples on the three sets of 10,000 samples. The success rate is defined as the percentage of adversarial samples that were successfully classified by the DNN as the adversarial target class. The distortion is defined to be the percentage of pixels modified in the legitimate sample to obtain the adversarial sample. In other words, it is the percentage of input features modified in order to obtain adversarial samples. We compute two average distortion values: one taking into account all samples and a second one only taking into account successful samples, which we write . Figure 11 presents the results for the three sets from which the original samples were extracted. The results are consistent across all sets. On average, the success rate is , the average distortion of all adversarial samples is , and the average distortion of successful adversarial samples is . This means that the average number of pixels modified to craft a successful adversarial sample is out of pixels. The first distortion figure is higher because it includes unsuccessful samples, for which the crafting algorithm used the maximum distortion , but was unable to induce a misclassification.
|Source set of original samples||Adversarial samples successfully misclassified||Average distortion|
|All adversarial samples||Successful adversarial samples|
We also studied crafting of adversarial samples using the decreasing saliency map. We found that the success rate was lower and the average distortion slightly lower. Again, decreasing pixel intensities is less successful at producing the desired adversarial behavior than increasing pixel intensities. Intuitively, this can be understood because removing pixels reduces the information entropy, thus making it harder for DNNs to extract the information required to classify the sample. Greater absolute values of intensity variations are more confidently misclassified by the DNN.
Looking at the previous experiment, about of the adversarial samples were not successfully crafted. This suggests that some samples are harder to exploit than others. Furthermore, the distortion figures reported are averaged on all adversarial samples produced but not all samples require the same distortion to be misclassified. Thus, we now study the hardness of different samples in order to quantify these phenomena. Our aim is to identify which source-target class pairs are easiest to exploit, as well as similarities between distinct source-target class pairs. A class pair is a pair of a source class and a target class . This hardness metric allows us to lay ground for defense mechanisms.
In this experiment, we construct a deeper understanding of the crafting algorithm’ success rate and average distortion for different source-target class pairs. We use the 90,000 adversarial samples crafted in the previous experiments from the 10,000 samples of the MNIST test set.
We break down the success rate reported in Figure 11 by source-target class pairs. This allows us to know, for a given source class, how many samples of that class were successfully misclassified in each of the target classes. In Figure 12, we draw the success rate matrix indicating which pairs are most successful. Darker shades correspond to higher success rates. The rows correspond to the success rate per source class while the columns correspond to the success rate per target class. If one reads the matrix row-wise, it can be perceived that classes 0, 2, and 8 are hard to start with, while classes 1, 7, and 9 are easy to start with. Similarly, reading the matrix column-wise, one can observe that classes 1 and 7 are very hard to make, while classes 0, 8, and 9 are easy to make.
In Figure 13, we report the average distortion of successful samples by source-target class pair, thus identifying class pairs requiring the most distortion to successfully craft adversarial samples. Interestingly, classes requiring lower distortions correspond to classes with higher success rates in the previous matrix. For instance, the column corresponding to class 1 is associated with the highest distortions, and it was the column with the least success rates in the previous matrix. Indeed, the higher the average distortion of a class pair is, the more likely samples in that class pair are to reach the maximum distortion, and thus produce unsuccessful adversarial samples.
To better understand why some class pairs were harder to exploit, we tracked the evolution of class probabilities during the crafting process. We observed that the distortion required to leave the source class was higher for class pairs with high distortions whereas the distortion required to reach the target class, once the source class had been left, remained similar. This correlates with the fact that some source classes are more confidently classified by the DNN then others.
Results indicating that some source-target class pairs are not as easy as others lead us to question the existence of a measure quantifying the distance between two classes. This is relevant to a defender seeking to identify which classes of a DNN are most vulnerable to adversaries. We name this measure the hardness of a target class relatively to a given source class. It normalizes the average distortion of a class pair relatively to its success rate:
where is the average distortion of a set of samples for the corresponding success rate . In practice, these two quantities are computed over a finite number of samples by fixing a set of maximum distortion parameter values in the crafting algorithm where . The set of maximum distortions gives a series of pairs for . Thus, the practical formula used to compute the hardness of a source-destination class pair can be derived from the trapezoidal rule:
We computed the hardness values for all classes using a set of maximum distortion values in the algorithm. Average distortions and success rates are averaged over 9,000 adversarial samples for each maximum distortion value . Figure 14 shows the hardness values for all pairs . The reader will observe that the matrix has a shape similar to the average distortion matrix plotted on Figure 13. However, the hardness measure is more accurate because it is plotted using a series of maximum distortions.
The measure introduced lays ground towards finding defenses against adversarial samples. Indeed, if the hardness measure were to be predictive instead of being computed after adversarial crafting, the defender could identify vulnerable inputs. Furthermore, a predictive measure applicable to a single sample would allow a defender to evaluate the vulnerability of specific samples as well as class pairs. We investigated several complex estimators including convolutional transformations of the forward derivative or Hessian matrices. However, we found that simply using a formulae derived from the intuition behind adversarial saliency maps gave enough accuracy for predicting the hardness of samples in our experimental setup.
We name this predictive measure the adversarial distance of sample to class and write it . Simply put, it estimates the distance between a sample and a target class . We define the distance as:
where is the indicator function for event (i.e., is if and only if is true). In a nutshell, is the normalized number of non-zero elements in the adversarial saliency map of computed during the first crafting iteration in Algorithm 2. The closer the adversarial distance is to 1, the more likely sample is going to be harder to misclassify in target class . Figure 15 confirms that this formulae is empirically well-founded. It illustrates the value of the adversarial distance averaged per source-destination class pairs, making it easy to compare the average value with the hardness matrix computed previously after crafting samples. To compute it, we slightly altered Equation 13 to sum over pairs of features, reflecting the observations made during our validation process.
This notion of distance between classes intuitively defines a metric for the robustness of a network against adversarial perturbations. We suggest the following definition :
where the set of samples considered is sufficiently large to represent the input domain of the network. A good approximation of the robustness can be computed with the training dataset. Note that the operator used here can be replaced by other relevant operators, like the statistical expectation. The study of various operators is left as future work.
Recall that adversarial samples must not only be misclassified as the target class by deep neural networks, but also visually appear (be classified) as the source class by humans. To evaluate this property, we ran an experiment using 349 human participants on the Mechanical Turk online service. We presented three original or adversarially altered samples from the MNIST dataset to human participants. To paraphrase, participants were asked for each sample: (a) ‘is this sample a numeric digit?’, and (b) ‘if yes to (a) what digit is it?’. These two questions were designed to determine how distortion and intensity rates effected human perception of the samples.
The first experiment was designed to identify a baseline perception rate for the input data. The participants were presented 3 of 222 unaltered samples randomly picked from the original MNIST data set. Respondents identified as digits and classified the digits correctly of the samples.
Shown in Figure 16, a second set of experiments attempted to evaluate how the amount of distortion () impacts human perception. Here, participants were presented with a total of samples with varying levels of distortion (and features altered with an intensity increase ). The experiments showed that below a threshold ( distortion), participants were able to identify samples as digits () and correctly classify them () only slightly less accurately than the unaltered samples. The classification rate dropped dramatically () at distortion rates above the threshold.
A final set of experiments evaluate the impact of intensity variations () on perception, as shown Figure 17. The participants were accurate at identifying samples as digits () and classifying them correctly (). At higher absolute intensities ( and ), specific digit classification decreased slightly ( and ), but identification as digits was largely unchanged.
While preliminary, these experiments confirm that the overwhelming number of generated samples retain human recognizability. Note that because we can generate samples with less than the distortion threshold for the almost all of the input data, ( for roughly 97% in the MNIST data), we can produce adversarial samples that humans will mis-interpret—thus meeting our adversarial goal. Furthermore, altering feature distortion intensity provides even better results: at , humans classified the sample data at essentially the same rates as the original sample data.
We introduced a new class of algorithms that systematically craft adversarial samples misclassified by a DNN once an adversary possesses knowledge of the DNN architecture. Although we focused our work on DL techniques used in the context of classification and trained with supervised methods, our approach is also applicable to unsupervised architectures. Instead of achieving a given target class, the adversary achieves a target output . Because the output space is more complex, it might be harder or impossible to match . In that case, Equation 1 would need to be relaxed with an acceptable distance between the network output and the adversarial target
. Thus, the only remaining assumption made in this paper is that DNNs are feedforward. In other words, we did not consider recurrent neural networks, with cycles in their architecture, as the forward derivative must be adapted to accommodate such networks.
One of our key results is reducing the distortion—the number of features altered—to craft adversarial samples, compared to previous work. We believe this makes adversarial crafting much easier for input domains like malware executables, which are not as easy to perturb as images [11, 16]. This distortion reduction comes with a performance cost. Indeed, more elaborate but accurate saliency map formulae are more expensive to compute for the attacker. We would like to emphasize that our method’s high success rate can be further improved by adversaries only interested in crafting a limited number of samples. Indeed, to lower the distortion of one particular sample, an adversary can use adversarial saliency maps to fine-tune the perturbation introduced. On the other hand, if an adversary wants to craft large amounts of adversarial samples, performance is important. In our evaluation, we balanced these factors to craft adversarial samples against the DNN in less than a second. As far as our algorithm implementation was concerned, the most computationally expensive steps were the matrix manipulations required to construct adversarial saliency maps from the forward derivative matrix. The complexity is dependent of the number of input features. These matrix operations can be made more efficient, notably by making better use of GPU-accelerated computations.
Our efforts so far represent a first but meaningful step towards mitigating adversarial samples: the hardness and adversarial distance metrics lay out bases for defense mechanisms. Although designing such defenses is outside of the scope of this paper, we outline two classes of defenses: (1) adversarial sample detection and (2) improvements of DNN robustness.
Developing techniques for adversarial sample detection is a reactive solution. During our experimental process, we noticed that adversarial samples can for instance be detected by evaluating the regularity of samples. More specifically, in our application example, the sum of the squared difference between each pair of neighboring pixels is always higher for adversarial samples than for benign samples. However, there is no a priori reason to assume that this technique will reliably detect adversarial samples in different settings, so extending this approach is one avenue for future work. Another approach was proposed in , but it is unsuccessful as by stacking the denoising auto-encoder used for detection with the original DNN, the adversary can again produce adversarial samples.
The second class of solutions seeks to improve training to in return increase the robustness of DNNs. Interestingly, the problem of adversarial samples is closely linked to training. Work on generative adversarial networks showed that a two player game between two DNNs can lead to the generation of new samples from a training set . This can help augment training datasets. Furthermore, adding adversarial samples to the training set can act like a regularizer . We also observed in our experiments that training with adversarial samples makes crafting additional adversarial samples harder. Indeed, by adding 18,000 adversarial samples to the original MNIST training dataset, we trained a new instance of our DNN. We then run our algorithms again on this newly trained network and crafted a set of 9,000 adversarial samples. Preliminary analysis of these adversarial samples crafted showed that the success rate was reduced by while the average distortion increased by , suggesting that training with adversarial samples can make DNNs more robust.
The security of machine learning  is an active research topic within the security and machine learning communities. A broad taxonomy of attacks and required adversarial capabilties are discussed in  and  along with considerations for building defense mechanisms. Biggio et al. studied classifiers in adversarial settings and outlined a framework securing them 
. However, their work does not consider DNNs but rather other techniques used for binary classification like logistic regression or Support Vector Machines. Generally speaking, attacks against machine learning can be separated into two categories, depending on whether they are executed during training or at test time .
Prior work on adversarial sample crafting against DNNs derived a simple technique corresponding to the Architecture and Training Tools threat model, based on the backpropagation procedure used during network training [18, 30, 36]. This approach creates adversarial samples by defining an optimization problem based on the DNN’s cost function. In other words, instead of computing gradients to update DNN weights, one computes gradients to update the input, which is then misclassified as the target class by a DNN. The alternative approach proposed in this paper is to identify input regions that are most relevant to its classification by a DNN. This is accomplished by computing the saliency map of a given input, as described by Simonyan et al. in the case of DNNs handling images . We extended this concept to create adversarial saliency maps highlighting regions of the input that need to be perturbed in order to accomplish the adversarial goal.
Previous work by Yosinki et al. investigated how features are transferable between deep neural networks , while Szegedy et al. showed that adversarial samples can indeed be misclassified across models . They report that once an adversarial sample is generated for a given neural network architecture, it is also likely to be misclassified in neural networks designed differently, which explains why the attack is successful. However, the effectiveness of this kind of attack depends on (1) the quality and size of the surrogate dataset collected by the adversary, and (2) the adequateness of the adversarial network used to craft adversarial samples.
Broadly speaking, this paper has explored adversarial behavior in deep learning systems. In addition to exploring the goals and capabilities of DNN adversaries, we introduced a new class of algorithms to craft adversarial samples based on computing forward derivatives. This technique allows an adversary with knowledge of the network architecture to construct adversarial saliency maps that identify features of the input that most significantly impact output classification. These algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample.
Solutions to defend DNNs against adversaries can be divided in two classes: detecting adversarial samples and improving the training phase. The detection of adversarial samples remains an open problem. Interestingly, the universal approximation theorem formulated by Hornik et al. states one hidden layer is sufficient to represent arbitrarily accurately a function . Thus, one can intuitively conceive that improving the training phase is key to resisting adversarial samples.
In future work, we plan to address the limitations of DNN trained in an unsupervised manner as well as cyclical recurrent neural networks (as opposed to acyclical networks considered throughout this paper). Also, as most models of our taxonomy have yet to be researched, this leaves room for further investigation of DL in various adversarial settings.
The authors would like to warmly thank Dr. Damien Octeau and Aline Papernot for insightful discussions about this work. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
International Journal of Pattern Recognition and Artificial Intelligence, 28(07):1460002, 2014.
A unified architecture for natural language processing: Deep neural networks with task learning.In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
Evading network anomaly detection systems: formal reasoning and practical techniques.In Proceedings of the 13th ACM conference on Computer and communications security, pages 59–68. ACM, 2006.
The mnist database of handwritten digits, 1998.
Convolutional, long short-term memory, fully connected deep neural networks.2015.