Artificial Neural Networks have made a resurgence in recent times and have achieved state of the art results on numerous tasks such as image classification . As their popularity rises the investigation of their security will become ever more relevant. Adversarial examples in particular - which involve small, tailored changes to the input to make the neural network misclassify it - pose a serious threat to the safe utilization of neural networks. Recent works have shown that adversarial samples comprise of non-robust features of datasets, and that neural networks trained on adversarial samples can generalize to the test set . Because these non-robust features are invisible for humans, performing inference on lossy reconstructions of the adversarial input has the potential to shed light on the dependence between the adversarial noise and the robust features of the image.
In this work, we seek to analyze adversarial samples in terms of their contribution to the principal components of an image and characterize the vulnerability of these models. We test our method for a number of different Deep Neural Network architectures, datasets and attack types, and identify a general trend about adversarial samples.
Background and Prior Work
We consider a neural network used for classification where
represents the probability that imagecorresponds to class . Images are represented as , where are the width, height and number of channels of the image. We denote the classification of the network as , with representing the true class, or the ground truth of the image. Given an image
and an image classifier, an adversarial sample follows two properties:
is small for some distance metric , implying that the images and appear visually similar to humans.
. This means that the prediction on the adversarial sample is incorrect whereas the original prediction is correct.
In this work, we focus on 3 methods to generate adversarial samples.
Deepfool  is an iterative untargeted attack technique to manipulate the decision boundaries of neural networks while minimizing the distance metric between the altered (adversarial) example and the original image.
Jacobian Saliency Map Attack:
Papernot et al. introduced the Jacobian-based Saliency Map Attack , a targeted attack optimized under the distance. The attack is a greedy algorithm that utilizes the saliency map of neural networks to pick pixels to modify one at a time, increasing the target classification on each iteration.
Carlini Wagner Attack.
For a given image, the goal of the Carlini Wagner attack  is to find a small perturbation such that the model misclassifies the input as a chosen adversarial class. The attack can be formulated as the following optimization problem: such that where is the p-norm. In this paper we use the norm i.e. .
There have been several prior works in detecting adversarial samples. DeepFense formalizes the goal of thwarting adversarial attacks as an optimization problem that minimizes the rarely observed regions in the latent feature space spanned by a neural network.  seek to minimize the reverse cross-entropy which encourage deep networks to learn latent representations that better distinguish adversarial examples from normal ones.  identify exploitation channels and utilize them for adversarial sample detection.
Our work is most similar to , which characterizes adversarial samples in terms of the Local Intrinsic Dimensionality, and to  and , which show PCA to be an effective defense against certain adversarial attacks on smaller datasets such as MNIST. Our method, however, is different in that we seek to understand adversarial samples based on their contributions to the principal components of a single image, and that we use the rows as principal components, thereby allowing us to scale our technique to much larger datasets such as ImageNet.
There are two different settings for adversarial attacks. The most general setting is the black box threat model where adversaries do not have access to any information about the neural network (e.g. gradient) except for the predictions of the network. In the white box threat model all information about the neural network is accessible, including its weights, architecture, gradients and training method. In this work we consider situations where adversaries have white-box access to the neural network.
 and  have shown PCA to be an effective defense against certain adversarial attacks on smaller datasets, where is the number of samples in the dataset and the number of features of each sample. This works well when the dataset and number of features are small, however, for larger datasets with larger inputs this method becomes computationally inefficient as the size of the data matrix scales quadratically with the size of dataset.
To tackle this emerging problem we suggest an alternative way to perform PCA, where is the number of rows and is the product of the number of columns and the channels of an image . In doing so we can capture the correlations between pixels of an image and vastly reduce the number of dimensions required for PCA. Additionally, this method is independent of the dataset size. Furthermore, our method also has the added advantage that it requires no knowledge of the dataset which makes it more versatile.
We term this new method of performing PCA as rowPCA denoted as for an input image , which treats each row of as a principal axis. As an example, an ImageNet input image with dimensionality will generate principal components. We can then reconstruct our image
from the principal components with smaller components contributing smaller variance to the image. We denote the firstprincipal components as , and the image reconstruction operation as . The reconstructed image generated from the first row principal components is thus . Figure 1 shows several reconstructed inputs for a benign sample.
Detecting Dominant Classes
We define the dominant class as the predicted class on the full image . The point is defined as a tuple consisting of the component when the dominant class starts becoming the top prediction and the softmax probability of the dominant class at that particular component number. Algorithm 1 outlines the procedure to obtain the point for a particular input, and Figure 2 demonstrates the functionality of our detection method on an adversarial sample. The steps that occur are:
The input image is decomposed into its principal components by the rows.
Each of the sets
of descending principal components (sorted by by eigenvalue) is used to reconstruct the image.
Each reconstructed image is fed through the neural network and the predictions are observed.
The point is found for the particular set of predictions for each image and is subsequently used to determine whether that particular sample is adversarial or benign.
Datasets and Models: We evaluated our method on neural networks pre-trained on the ImageNet dataset 
in PyTorch, namely Inception-v3,, Resnet-50 ,and VGG19 .
Behavior of adversarial samples.
Figures (a)a, (b)b and (c)c shows the clustering of adversarial samples in similar regions of the space, while figure (d)d shows the clustering of benign samples in similar regions of the space. Figure 4 shows the points of all the adversarial and benign points, demonstrating their separability.
Detection of adversarial samples
We train binary classifiers on the points for a fixed pair and evaluate them against points from the same pair as well as adversarial samples derived using other attacks targeted towards different architectures.
Intra-model detection rate: Given a (attack, model) pair, this metric measures the probability of predicting whether a given point is either a benign or adversarial sample. We gathered 128 correctly-predicted benign samples and 100 adversarial points for each (attack, model) pair for the ImageNet dataset and used an AdaBoost 
classifier with 200 weak Decision Tree estimators to distinguish between the two. We achieve an average prediction rate of 94.81%, namely that we can correctly predict whether a sample for a given neural network will be adversarial or benign 94.81% of the time.
Inter-model detection rate: We observe the distributions of adversarial samples across all attack types and models in order to determine their similarity. To measure this, we train classifiers trained on one (attack, model) pair and evaluate them on benign and adversarial samples for every other (attack, model) pair, as demonstrated in Figure 5. We achieve an average adversarial detection rate of 93.36% across all architectures and adversarial methods.
PCA is one of numerous linear methods for dimensionality reduction of neural network inputs. Other techniques such as Sparse Dictionary Learning and Local Linear Embedding are potential alternatives to PCA, which we intend on exploring in future work. One particular limitation of our method is the need for many rows in the input, which would make our defense inapplicable to inputs for smaller neural networks.
We identify a new metric, the point, to analyze adversarial samples in terms of their contributions to the principal components of an image. We demonstrate empirically that the points of benign and adversarial samples are distinguishable across adversarial attacks and neural network architectures and are an underlying property of the dataset itself. We train a binary classifier to detect adversarial samples and achieve a 93.36% detection success rate.
Dimensionality reduction as a defense against evasion attacks on machine learning classifiers. Cited by: Prior Work, Defensive PCA.
Adversarial examples are not easily detected: bypassing ten detection methods.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: Carlini Wagner Attack., Prior Work, Defensive PCA, Experimental Setup.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: Experimental Setup.
-  (1999) A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14 (771-780), pp. 1612. Cited by: 1st item.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Experimental Setup.
-  (2019) Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175. Cited by: Introduction.
-  (2019) NIC: detecting adversarial samples with neural network invariant checking.. Cited by: Prior Work.
-  (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: Prior Work.
-  (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: DeepFool., Experimental Setup.
-  (2018) Towards robust detection of adversarial examples. In Advances in Neural Information Processing Systems, pp. 4579–4589. Cited by: Prior Work.
The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: Jacobian Saliency Map Attack:, Experimental Setup.
-  (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: Experimental Setup.
-  (2018) Deepfense: online accelerated defense against adversarial deep learning. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. Cited by: Prior Work.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Introduction.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Experimental Setup.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Experimental Setup.