Log In Sign Up

Principal Component Properties of Adversarial Samples

by   Malhar Jere, et al.

Deep Neural Networks for image classification have been found to be vulnerable to adversarial samples, which consist of sub-perceptual noise added to a benign image that can easily fool trained neural networks, posing a significant risk to their commercial deployment. In this work, we analyze adversarial samples through the lens of their contributions to the principal components of each image, which is different than prior works in which authors performed PCA on the entire dataset. We investigate a number of state-of-the-art deep neural networks trained on ImageNet as well as several attacks for each of the networks. Our results demonstrate empirically that adversarial samples across several attacks have similar properties in their contributions to the principal components of neural network inputs. We propose a new metric for neural networks to measure their robustness to adversarial samples, termed the (k,p) point. We utilize this metric to achieve 93.36 accuracy in detecting adversarial samples independent of architecture and attack type for models trained on ImageNet.


Scratch that! An Evolution-based Adversarial Attack against Neural Networks

Recent research has shown that Deep Neural Networks (DNNs) for image cla...

Sitatapatra: Blocking the Transfer of Adversarial Samples

Convolutional Neural Networks (CNNs) are widely used to solve classifica...

On the Adversarial Robustness of Neural Networks without Weight Transport

Neural networks trained with backpropagation, the standard algorithm of ...

DAmageNet: A Universal Adversarial Dataset

It is now well known that deep neural networks (DNNs) are vulnerable to ...

Deep neural network loses attention to adversarial images

Adversarial algorithms have shown to be effective against neural network...

ADef: an Iterative Algorithm to Construct Adversarial Deformations

While deep neural networks have proven to be a powerful tool for many re...

Stealing Knowledge from Protected Deep Neural Networks Using Composite Unlabeled Data

As state-of-the-art deep neural networks are deployed at the core of mor...


Artificial Neural Networks have made a resurgence in recent times and have achieved state of the art results on numerous tasks such as image classification [14]. As their popularity rises the investigation of their security will become ever more relevant. Adversarial examples in particular - which involve small, tailored changes to the input to make the neural network misclassify it - pose a serious threat to the safe utilization of neural networks. Recent works have shown that adversarial samples comprise of non-robust features of datasets, and that neural networks trained on adversarial samples can generalize to the test set [6]. Because these non-robust features are invisible for humans, performing inference on lossy reconstructions of the adversarial input has the potential to shed light on the dependence between the adversarial noise and the robust features of the image.

In this work, we seek to analyze adversarial samples in terms of their contribution to the principal components of an image and characterize the vulnerability of these models. We test our method for a number of different Deep Neural Network architectures, datasets and attack types, and identify a general trend about adversarial samples.

Background and Prior Work

Adversarial Samples

We consider a neural network used for classification where

represents the probability that image

corresponds to class . Images are represented as , where are the width, height and number of channels of the image. We denote the classification of the network as , with representing the true class, or the ground truth of the image. Given an image

and an image classifier

, an adversarial sample follows two properties:

  • is small for some distance metric , implying that the images and appear visually similar to humans.

  • . This means that the prediction on the adversarial sample is incorrect whereas the original prediction is correct.

In this work, we focus on 3 methods to generate adversarial samples.


Deepfool [9] is an iterative untargeted attack technique to manipulate the decision boundaries of neural networks while minimizing the distance metric between the altered (adversarial) example and the original image.

Jacobian Saliency Map Attack:

Papernot et al. introduced the Jacobian-based Saliency Map Attack [11], a targeted attack optimized under the distance. The attack is a greedy algorithm that utilizes the saliency map of neural networks to pick pixels to modify one at a time, increasing the target classification on each iteration.

Carlini Wagner Attack.

For a given image, the goal of the Carlini Wagner attack [2] is to find a small perturbation such that the model misclassifies the input as a chosen adversarial class. The attack can be formulated as the following optimization problem:    such that    where is the p-norm. In this paper we use the norm i.e. .

Prior Work

There have been several prior works in detecting adversarial samples. DeepFense[13] formalizes the goal of thwarting adversarial attacks as an optimization problem that minimizes the rarely observed regions in the latent feature space spanned by a neural network. [10] seek to minimize the reverse cross-entropy which encourage deep networks to learn latent representations that better distinguish adversarial examples from normal ones. [7] identify exploitation channels and utilize them for adversarial sample detection.

Our work is most similar to [8], which characterizes adversarial samples in terms of the Local Intrinsic Dimensionality, and to [2] and [1], which show PCA to be an effective defense against certain adversarial attacks on smaller datasets such as MNIST. Our method, however, is different in that we seek to understand adversarial samples based on their contributions to the principal components of a single image, and that we use the rows as principal components, thereby allowing us to scale our technique to much larger datasets such as ImageNet.


Threat Model

There are two different settings for adversarial attacks. The most general setting is the black box threat model where adversaries do not have access to any information about the neural network (e.g. gradient) except for the predictions of the network. In the white box threat model all information about the neural network is accessible, including its weights, architecture, gradients and training method. In this work we consider situations where adversaries have white-box access to the neural network.

Defensive PCA

[2] and [1] have shown PCA to be an effective defense against certain adversarial attacks on smaller datasets, where is the number of samples in the dataset and the number of features of each sample. This works well when the dataset and number of features are small, however, for larger datasets with larger inputs this method becomes computationally inefficient as the size of the data matrix scales quadratically with the size of dataset.

To tackle this emerging problem we suggest an alternative way to perform PCA, where is the number of rows and is the product of the number of columns and the channels of an image . In doing so we can capture the correlations between pixels of an image and vastly reduce the number of dimensions required for PCA. Additionally, this method is independent of the dataset size. Furthermore, our method also has the added advantage that it requires no knowledge of the dataset which makes it more versatile.

We term this new method of performing PCA as rowPCA denoted as for an input image , which treats each row of as a principal axis. As an example, an ImageNet input image with dimensionality will generate principal components. We can then reconstruct our image

from the principal components with smaller components contributing smaller variance to the image. We denote the first

principal components as , and the image reconstruction operation as . The reconstructed image generated from the first row principal components is thus . Figure 1 shows several reconstructed inputs for a benign sample.

Figure 1: Examples of PCA reconstructed images for a randomly chosen image from the ImageNet validation dataset.

Detecting Dominant Classes

We define the dominant class as the predicted class on the full image . The point is defined as a tuple consisting of the component when the dominant class starts becoming the top prediction and the softmax probability of the dominant class at that particular component number. Algorithm 1 outlines the procedure to obtain the point for a particular input, and Figure 2 demonstrates the functionality of our detection method on an adversarial sample. The steps that occur are:

  • The input image is decomposed into its principal components by the rows.

  • Each of the sets

    of descending principal components (sorted by by eigenvalue) is used to reconstruct the image.

  • Each reconstructed image is fed through the neural network and the predictions are observed.

  • The point is found for the particular set of predictions for each image and is subsequently used to determine whether that particular sample is adversarial or benign.

Figure 2: Visualization of defensive PCA applied to an adversarial input. For an input image, we reconstruct the image from the principal components and perform inference on each to determine the component when the dominant class starts becoming the top prediction. The dominant class could be the adversarial class for adversarial inputs, or the ground truth or misclassified class for benign inputs.
Result: point
1 begin:
2 ;
4 while  do
5       ;
Algorithm 1 Finding the point for a given neural network , input image , top scoring class on input image , and maximum number of principal components . We reconstruct the image from components through and find the point at which the dominant class is no longer dominant.


Experimental Setup

Datasets and Models: We evaluated our method on neural networks pre-trained on the ImageNet dataset [3]

in PyTorch, namely Inception-v3,

[16], Resnet-50 [5],and VGG19 [15].

Attack methods: For each of the models we evaluated our method on the DeepFool [9], Jacobian Saliency Map Attack (JSMA) [11] and Carlini-Wagner L2 attack [2] using the Foolbox library [12]. For each of the 9 (attack, model) pairs, we generated 100 adversarial images.


Behavior of adversarial samples.

Figures (a)a, (b)b and (c)c shows the clustering of adversarial samples in similar regions of the space, while figure (d)d shows the clustering of benign samples in similar regions of the space. Figure 4 shows the points of all the adversarial and benign points, demonstrating their separability.

(a) points for Carlini-Wagner adversarial samples
(b) points for DeepFool adversarial samples
(c) points for JSMA adversarial samples
(d) points for benign samples
Figure 3: points for adversarial and benign samples for ImageNet trained models.
Figure 4: points for all benign and adversarial ImageNet samples for all models and adversarial attacks.

Detection of adversarial samples

We train binary classifiers on the points for a fixed pair and evaluate them against points from the same pair as well as adversarial samples derived using other attacks targeted towards different architectures.

  • Intra-model detection rate: Given a (attack, model) pair, this metric measures the probability of predicting whether a given point is either a benign or adversarial sample. We gathered 128 correctly-predicted benign samples and 100 adversarial points for each (attack, model) pair for the ImageNet dataset and used an AdaBoost [4]

    classifier with 200 weak Decision Tree estimators to distinguish between the two. We achieve an average prediction rate of 94.81%, namely that we can correctly predict whether a sample for a given neural network will be adversarial or benign 94.81% of the time.

  • Inter-model detection rate: We observe the distributions of adversarial samples across all attack types and models in order to determine their similarity. To measure this, we train classifiers trained on one (attack, model) pair and evaluate them on benign and adversarial samples for every other (attack, model) pair, as demonstrated in Figure 5. We achieve an average adversarial detection rate of 93.36% across all architectures and adversarial methods.

Figure 5: Inter-model and Intra-model adversarial sample detection. We achieve near perfect prediction rates for simple discriminative models trained to identify adversarial samples from one (attack, model) pair and evaluated on a different one. The axis represents the (attack, model) we trained our classifier to identify, and the axis represents the (attack, model) we evaluated our classifier on.


PCA is one of numerous linear methods for dimensionality reduction of neural network inputs. Other techniques such as Sparse Dictionary Learning and Local Linear Embedding are potential alternatives to PCA, which we intend on exploring in future work. One particular limitation of our method is the need for many rows in the input, which would make our defense inapplicable to inputs for smaller neural networks.


We identify a new metric, the point, to analyze adversarial samples in terms of their contributions to the principal components of an image. We demonstrate empirically that the points of benign and adversarial samples are distinguishable across adversarial attacks and neural network architectures and are an underlying property of the dataset itself. We train a binary classifier to detect adversarial samples and achieve a 93.36% detection success rate.


  • [1] A. N. Bhagoji, D. Cullina, and P. Mittal (2017)

    Dimensionality reduction as a defense against evasion attacks on machine learning classifiers

    Cited by: Prior Work, Defensive PCA.
  • [2] N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    pp. 3–14. Cited by: Carlini Wagner Attack., Prior Work, Defensive PCA, Experimental Setup.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: Experimental Setup.
  • [4] Y. Freund and R. Schapire (1999) A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14 (771-780), pp. 1612. Cited by: 1st item.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Experimental Setup.
  • [6] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175. Cited by: Introduction.
  • [7] S. Ma, Y. Liu, G. Tao, W. Lee, and X. Zhang (2019) NIC: detecting adversarial samples with neural network invariant checking.. Cited by: Prior Work.
  • [8] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: Prior Work.
  • [9] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: DeepFool., Experimental Setup.
  • [10] T. Pang, C. Du, Y. Dong, and J. Zhu (2018) Towards robust detection of adversarial examples. In Advances in Neural Information Processing Systems, pp. 4579–4589. Cited by: Prior Work.
  • [11] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016)

    The limitations of deep learning in adversarial settings

    In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: Jacobian Saliency Map Attack:, Experimental Setup.
  • [12] J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: Experimental Setup.
  • [13] B. D. Rouhani, M. Samragh, M. Javaheripi, T. Javidi, and F. Koushanfar (2018) Deepfense: online accelerated defense against adversarial deep learning. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. Cited by: Prior Work.
  • [14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Introduction.
  • [15] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Experimental Setup.
  • [16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Experimental Setup.