Despite the successes of deep learning
in wide ranges of computer vision tasks such as image classification[13, 15], object detection [7, 22], semantic segmentation  etc., it is well known that neural networks are not robust. In particular, it has been shown that the addition of small but carefully crafted human imperceptible deviations to the input, called adversarial perturbations, can cause the neural network to make incorrect predictions with high confidence [4, 9, 16, 27]. The existence of adversarial examples pose a severe security threat to the practical deployment of deep learning models, particularly in safety critical systems, such as autonomous driving , healthcare  etc.
Starting with the seminal work , there has been extensive work in the area of crafting new adversarial perturbations. These work can be broadly classified in two categories: (i) finding strategies to defend against adversarial inputs [9, 29, 18]; (ii) new attack methods that are stronger and can break the proposed defense mechanism [18, 20].
Most of the defense strategies proposed in the literature are targeted towards a specific adversary, and as such they are easily broken by stronger adversaries [30, 2].  proposed robust optimization technique, which overcome this problem by finding the worst case adversarial examples at each training step and adding them to the training data. While the resulting models show strong empirical evidence that they are robust against many attacks, we cannot yet guarantee that a different adversary cannot find inputs that cause the model to predict incorrectly. In fact,  observed a phenomenon that motivates why Projected Gradient Descent (PGD) - the technique at the core of Madry et al.’s method - does not always find the worst-case attack. This finding initiates the desideratum to explore more efficient strategy to enforce on the model for improving Robustness, beside exploiting adversarial training framework as developed by .
We list the main contributions of our work below :
We design a model which progressively generalize a linear classifier, enforced via our regularisation method Linearity Constraint Regularization.
Our trained model achieve state of the art adversarial accuracy compared to other adversarial training approaches which aims to improve robustness against PGD attack on MNIST, CIFAR10 and SVHN datasets.
We also propose a two step iterative technique for generating adversarial image by leveraging Inverse Representation Learning and Linearity aspect of our adversarially trained model. We show that our proposed adversarial image generation method is well generalized and can act as adversary to other deep neural network classifiers also.
2 Background and Related Work
In this section, we briefly discuss about adversarial attack problem formulation in classification framework and then describe Projected Gradient Descent (PGD)  adversarial attack which we use as a baseline in our paper.
The goal of an adversarial attack is to find out minimum perturbation in the input space (i.e. input pixels for an image) that results in the change of class prediction by the network . Mathematically, we can write this as follows:
Here, and represents the classifier output component corresponding to the true class and any other class except the true class respectively. denotes the parameters of the classifier . Input and output are represented as and respectively and the objective function as .
The magnitude of adversarial perturbation is constrained by a where to ensure that the adversarially perturbed example is close to the original sample. Below, we pose this as an optimization problem:
Here represents perturbed sample at i-th iteration.
In general, there are mainly two types of adversarial attacks, white box and black box attacks. In white box attack, adversary has complete knowledge of the network and its parameters. While in Black box attack there is no information available about network architecture or parameters. Numerous works have been proposed to defend against such adversarial attacks. [3, 17, 10, 5, 21, 32, 25] showed different defense strategies to improve the robustness of deep neural networks. Unfortunately,  find that all these defense policies use obfuscated gradients, a kind of gradient masking, that leads to false sense of security in defense against adversarial examples and provides a limited improvement in robustness.  also observes Adversarial training is the only defense which significantly increases robustness to adversarial examples within the threat model.
Hence, we focus on adversarial training framework which allows to defend against adversarial attack by training neural networks on adversarial images that are generated on-the-fly during training. Adversarial training constitutes the current state of the art in adversarial robustness against white-box attacks. In this work we aim to improve model robustness further by leveraging adversarial training framework.
 proposed a network architecture (termed as FDT) that comprises of denoising blocks at the intermediate hidden layers which aims to remove noise from the features in latent layers introduced due to adversarial perturbation, in turn increases the adversarial robustness of the model.  observes that latent layers of an adversarially trained model is not robust and proposed a technique to perform adversarial training of latent layer (termed as LAT) in conjunction with adversarial training of the full network which aims at increasing the adversarial robustness of the model. In this paper we have shown improvements in adversarial accuracy as well as clean accuracy over these aforementioned methods.
3 Neural Network Architecture and Training Objective Function
In this section, we briefly discuss our proposed network architecture and training objective. As depicted in Fig. 1 , our network architecture consists of two separate branches: Concept and Significance branch. Concept branch consists of series of stacked residual convolution block of ResNet18  architecture along with global average pooling and two parallel components: fully connected layer and series of up-Convolution layers. Significance branch consists of series of stacked residual convolution block of ResNet18  architecture along with global average pooling and fully connected combined with reshape layer.
is fed as input to Concept branch to output “Concept Vector” from fully connected layer and reconstructed image from series of up-convolution layer. is also fed as input to Significance branch to produce “Class Significance Matrix” , a 2-Dimensional Matrix as output. Finally, we multiply “Concept Vector” with “Class Significance Matrix”
and apply the softmax function to produce class probabilities of the classifier.
Let’s assume, Concept branch of our network is defined by a sequence of transformations for each of its layers. For an input (which we represent as ), we can formulate the operation performed by the Concept branch of our architecture as below:
Here “Concept Vector” , where represents the number of different concepts, we want to capture from underlying data distribution.
Similarly we assume, Significance branch of the network is defined by a sequence of transformations for each of its layers. We can formulate the operation performed by the Significance branch of our architecture with same input (which we represent as ) as below:
Here “Class Significance Matrix” , where, and represents number of classes and concepts respectively.
We define logit vector as product of Concept vector and Class Significance Matrix . We formulate the logit vector and output of classifier as follows:
We term the above proposed network architecture as Linearised Concept Significance Net denoted as LiCS Net. In the next section, We shed lights to explain linear aspect in our classifier.
We also propose a training objective termed as Lipschitz and Linearity Constrained Objective Function as ObjL2C. It comprises of four different components : (a) Cross-Entropy Loss for Classification, denoted as , (b) Linearity Constraint Regularization loss, which is denoted as , (c) Local Lipschitz Regularizer for Concept Vector and Class Significance Matrix, which is denoted as and (d) Reconstruction error, which is represented as . Our training objective can be formulated as follows:
Here , , and are the regularizer coefficient corresponding to different component of our objective function.
We briefly discuss about different component of the objective function.
(a) Cross-Entropy loss is defined as follows:
Here represents class index, denotes number of classes, holds ground-truth class label score. hold class probability scores when an adversarial image is being fed through the classifier.
For the remainder of this section, refers to the generated adversarial image from its corresponding true image by following eq. 2. Note that ground-truth label is same for both and .
(b) Linearity Constraint Regularization loss is defined as follows:
Here represents the of a given vector. We are enforcing the logit vector to behave as a linear classifier with “Concept Vector” as the input and “Class Significance Matrix” as the parameters or weights of that linear classifier.
Intuitively, a linear classifier can be defined as , where is input and captures significance of each input features for the classification task. Comparing eq. 5 with linear classifier , we hypothesize encodes the transformed input features like and encodes the significance weight factor for concept features same as .
(c) Local Lipschitz Regularizer for Concept Vector and Class Significance Matrix is defined based on the concept of “Lipschitz constant”. Let’s assume a function having Lipschitz Constant value of . Below equation holds true for all .
Hence, we argue that a lower Lipschitz constant ensures that the function’s output for its corresponding perturbed input is not significantly different and it can be used to improve the adversarial robustness of a classifier.
The Lipschitz constant of our network has an upper bound defined by the product of Lipschitz constant of its sub-networks: Concept and Significance branch. We can formulate this as follows:
Here and are the Lipschitz constant of the sub network Concept branch and Significance branch respectively.
To achieve a robust classifier, we aim to improve robustness of its sub-networks using Local Lipschitz Regularizer for Concept Vector and Class Significance Matrix, which is formulated as follows:
Here, and are Concept vector and Class Significance Matrix respectively when a true image is passed through our network. Similarly, and are Concept vector and Class Significance Matrix respectively when the corresponding adversarial image is passed through the network. and are the regularization coefficients.
Note that, we enforce invariability of Concept vector and Class Significance matrix to the model for an original image and its corresponding adversarial image through Local Lipschitz Regularizer.
(d) Reconstruction error is defined as follows:
where is the input image and is the reconstructed image at last up-convolution layer of the Concept branch of our network.
4 Adversarial Image Generation
In this section, we propose adversarial image generation method which consists of two iterative steps stated as follows: (i) Find minimum perturbation of Concept Vector which guarantees to fool the classifier by exploiting “linearity” aspect of the classifier. (ii) Using the perturbed concept vector from step (i), generate an equivalent perturbation in image space (i.e. image pixels) by leveraging inverse representation which fools the classifier.
We elaborate above steps of adversarial image generation method below:
Step (i): We represent the i-th component of logit vector defined in eq. 5 as:
Here is the i-th column of Class Significance Matrix which encodes the significance weight factor of concepts for the i-th class.
Let 's assume for a given image , the network produces a Concept Vector , and the ground truth class label corresponding to image is . Now, the minimum perturbation () required in Concept Vector to fool the classifier can be formulated as follows:
In the above relation, “” represents the class index and must satisfy. Now, we can re-arrange and write above equation as follows:
As shown in Fig. 2, Concept vector is represented as a point in dimensional concept space. We decide to perturb the Concept vector towards the linear decision boundary of a target class , which satisfy the following relation:
Here, is the target class index, which has the decision boundary nearest to the Concept vector compared to decision boundary of any other class in terms of norm.
We then shift the Concept vector in the direction normal to the linear decision boundary surface. As shown in Fig. 2, the movement of Concept vector is represented by a red arrow and the linear decision boundary of the target class is depicted as . Mathematically, we represent the direction of the optimal shift () of the concept vector as follows:
Hence, we obtain a new perturbed concept vector , by adding the optimal shift to the actual value of concept vector . We represents this as follows:
We perform this whole process iteratively, until the perturbed concept vector change the classifier’s true prediction. We denote the final perturbed concept vector as which obeys below condition:
Step (ii) : From step (i), we obtain a final perturbed concept vector which guarantees to change the classifier’s prediction. Now the main question is “How do we generate the adversarial image () which corresponds to the final Perturbed Concept vector ?”.
As illustrated in , the Representation vector of a given input are the activations of the penultimate layer of the network.  also illustrate that adversarially trained model produces similar Representation vectors for semantically similar images. For a given input, Concept vector generated from the Concept branch of our network LiCS Net is analogous to Representation vector.
We have used “Inverse Representation” [19, 31, 6] method to generate adversarial image which corresponds to the final Perturbed Concept vector , starting from the true input image and its corresponding Concept vector . We can pose this method as an optimization problem detailed in Algorithm 1.
We term above proposed algorithm as Adversarial Image Generation using Linearity and Inverse Representation as AIGLiIR.
In Fig. 3, we depicted few adversarial images generated by our proposed method . Such adversarial image guarantees to fool white box threat model. For example, in the case of CIFAR10, generated adversarial images by method are being classified by our proposed LiCS Net as Bird and Deer whereas its corresponding true level are Airplane and Horse.
Moreover with evident results provided in next section, we claim that our adversarial image generation method is very effective in fooling various deep neural network designed for classification task.
We infer most deep neural network classifier consists of series of feature extraction layers or convolutional layers followed by one or more fully connected layers. Hence, we represents such deep neural network classifieras: . Here, act as the transformation applied on input by sub network of which comprises of all the layers upto penultimate layer and act as the significance factor applied by last fully connected layer, to each transformed features from .
We observe that f and g are analogous to Concept vector and Significance Matrix of our network LiCS Net such that Concept vector encodes the transformed features of input image and Significance Matrix captures significance weight factor for each concept features. We argue that our classifier LiCS Net is linear on its transformed input features. Hence we can also apply our adversarial image generation to other classification network.
5 Experiments and Results
In this section, we briefly discuss our experimental findings to analyze the robustness of our proposed network LiCS Net. We have performed experiment to show the “Transferability” of the adversarial images generated by our method AIGLiIR. We also discuss findings on the effectiveness of Linearity Constraint and Local Lipschitz Regularizer in our objective function to improve adversarial robustness. The samples are constructed using PGD adversarial perturbations as proposed in  for adversarial training. For the remainder of this section, we refer an adversarially trained model as a model which is trained using PGD adversarial examples . Clean and Adversarial accuracy are defined as a accuracy of a network over original images and over the adversarial images generated from the test dataset respectively. Higher adversarial accuracy implies more robust network. To test the efficacy of our proposed training method, we perform experiments over CIFAR10, SVHN and MNIST dataset. For fairness, we compare our method against three different fine-tuned techniques applied in adversarial training framework. These are Adversarial Training (AT) , Feature Denoising Training (FDT)  and Latent adversarial Training( (LAT) .
5.1 Evaluation using Adversarial Accuracy
We train LiCS Net using training objective and denote this as Concept-Significance Adversarial Training (CSAT). In order to compare Concept-Significance Adversarial Training (CSAT) with other fine tuned adversarial techniques we use PGD adversarial perturbation.
PGD configuration : The configuration of PGD adversarial perturbation varies for different datasets. For MNIST dataset, we restrict the maximum amount of per pixel perturbation as (in the pixel scale of 0 to 1 range) and we choose 40 steps of PGD iterations with step size of 2/255. For CIFAR10 and SVHN dataset, we restrict the maximum amount of per pixel perturbation as (in the pixel scale of 0 to 255 range) and we choose 10 steps of PGD iterations with step size of 2/255.
Concept-Significance Adversarial Training(CSAT) vs others : Table 1 reports the adversarial accuracy comparison between our proposed technique Concept-Significance Adversarial Training (CSAT) and other fine tuned adversarial training techniques such as Adversarial Training (AT), Feature Denoising Training (FDT), Latent adversarial Training(LAT) over different datasets. Concept-Significance Adversarial Training (CSAT) outperforms all the other fine tuned adversarial training strategies such as AT, FDT, LAT and achieves state of the art adversarial accuracy on various datasets such as MNIST, CIFAR10 and SVHN.
Concept-Significance Adversarial Training (CSAT) achieves adversarial accuracy of 54.77%, 60.41% and 98.68% on CIFAR10, SVHN and MNIST dataset respectively. Note that in AT, FDT and LAT, fine tuning strategies are being performed over Wide ResNet based (WRN 32-10 wide) architecture  in the case of CIFAR10 and SVHN dataset. We use the same network configuration for all the different datasets. Note that we choose dimension of concept vector as 10 and the dimension of class significance matrix as 1010. To train our model we use adam optimizer  with the learning rate of 0.0002.
|Dataset||Adversarial Training Techniques||Adversarial Accuracy||Clean Accuracy|
5.2 Transferability of Generated Adversarial Samples
Training on adversarial samples has been shown as one of the best method to improve robustness of a classifier [18, 29, 23]. We denote as the adversarial image generation method to generate adversarial samples using PGD adversarial perturbations  on our trained network LiCS Net using our objective function . In previous section, we proposed our adversarial image generation method to generate adversarial samples using our trained network LiCS Net.
We evaluate adversarial accuracy of different standard pre-trained models such as GoogleNet  (trained on CIFAR10), ResNet34  (trained on SVHN) and custom Net-3FC (trained on MNIST) on adversarial samples generated using and
method and compare with their respective clean test accuracy. Note that custom Net-3FC consists of 3 fully connected layers along with Relu activation and softmax function.
AIGLiIR vs AIGCSPGD : As evident in Fig. 4, For classifiers GoogleNet, ResNet34 and custom Net-3FC, we observe significant drop in accuracy on adversarial samples generated using and method, in comparison to their clean accuracy. For example in case of ResNet34, drop in accuracy on adversarial samples generated using method is 18.34% more than drop in accuracy on adversarial samples generated using in comparison with clean accuracy on SVHN dataset. Similary in case of GoogleNet, drop in accuracy on adversarial samples generated using method is 40.82% more than drop in accuracy on adversarial samples generated using , in comparison with clean accuracy on CIFAR10 dataset.
These experimental finding suggests adversarial samples generated using act as a better adversary to various network architectures such as GoogleNet, Resnet34 and custom Net-3FC which shows the “transferability” of these adversarial samples. Our adversarial image generation method can also be used for black box attack to other classification networks due to its robustness and transferability.
In Fig. 5, we depict few adversarial samples generated by method using the original test set images of the MNIST, CIFAR10 and SVHN datasets. These generated samples fooled standard pre-trained networks such as custom Net-3FC, GoogleNet and ResNet34. But our proposed model LiCS Net, trained using our proposed objective function as described in eq. 7 correctly predicts the true class of these generated adversarial samples.
5.3 Ablation Study
Architecture : We have used our proposed network LiCS Net for ablation experiments on our objective function as stated in eq. 7
. We kept all the hyperparameters same for different experiments of ablation study. We evaluated adversarial accuracy of our network LiCS Net trained with base objective function (termed as) which consists of cross-entropy and reconstruction loss and used it as our baseline adversarial accuracy for ablation study.
Linearity Constraint vs Local Lipschitz Regularizer : As evident from Table 2, adversarial accuracy improves in both the cases when base training objective is augmented with Linearity Constraint Regularizer (termed as ) and also when is augmented with Local Lipschitz Regularizer (termed as ) compared to the baseline adversarial accuracy. Experimental findings suggests that the regularisation impact of Linearity Constraint is much more effective compared to Local Lipschitz Regularisation to improve model robustness. Note that we achieve state of the art adversarial accuracy when is augmented with both Linearity Constraint and Local Lipschitz Regulariser.
|Training Obj.||Adv. Accuracy|
We observe that adversarial training method is the de-facto method to improve model robustness. We propose the model LiCS Net which achieves state of the art adversarial accuracy trained with our proposed objective function on the MNIST, CIFAR10 and SVHN datasets along with the improvement in normal accuracy. We also propose an Adversarial Image Generation method AIGLiIR
that exploits Linearity Constraint and Inverse Representation learning to construct adversarial examples. We performed several experiments to exhibit the robustness and transferability of the adversarial samples generated by our Adversarial Image Generation method across different networks. We demonstrate that the model trained with Linearity Constrained Regularizer in the adversarial training framework boost adversarial robustness. Through our research, we shed lights on the impact of linearity on robustness. We hope, our findings will inspire discovery of new adversarial defenses and attacks and, offers a significant pathway for new developments in adversarial machine learning.
-  S. Abdelfattah, G. Abdelrahman, and M. Wang. Augmenting the size of eeg datasets using generative adversarial networks. 05 2018.
-  A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018.
-  J. Buckman, A. Roy, C. Raffel, and I. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. 2018.
-  N. Carlini and D. A. Wagner. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016.
-  G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi, A. Khanna, and A. Anandkumar. Stochastic activation pruning for robust adversarial defense. CoRR, abs/1803.01442, 2018.
-  L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, B. Tran, and A. Madry. Learning perceptually-aligned representations via adversarial robustness. CoRR, abs/1906.00945, 2019.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. The MIT Press, 2016.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.
-  C. Guo, M. Rana, M. Cissé, and L. van der Maaten. Countering adversarial images using input transformations. CoRR, abs/1711.00117, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc.
-  A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
-  X. Ma, B. Li, Y. Wang, S. M. Erfani, S. N. R. Wijewickrema, M. E. Houle, G. Schoenebeck, D. Song, and J. Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. CoRR, abs/1801.02613, 2018.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ArXiv, abs/1706.06083, 2017.
-  A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. CoRR, abs/1412.0035, 2014.
-  S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. CoRR, abs/1610.08401, 2016.
-  C. Moustapha, B. Piotr, G. Edouard, D. Yann, and U. Nicolas. Parseval networks: Improving robustness to adversarial examples. Proc. Int. Conf. Machine Learning, 04 2017.
-  J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
-  A. Sinha, M. Singh, N. Kumari, B. Krishnamurthy, H. Machiraju, and V. N. Balasubramanian. Harnessing the vulnerability of latent layers in adversarially trained models. CoRR, abs/1905.05186, 2019.
-  C. Sitawarin, A. N. Bhagoji, A. Mosenia, M. Chiang, and P. Mittal. DARTS: deceiving autonomous cars with toxic signs. CoRR, abs/1802.06430, 2018.
-  Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. CoRR, abs/1710.10766, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
-  V. Tjeng and R. Tedrake. Verifying neural networks with mixed integer programming. CoRR, abs/1711.07356, 2017.
-  F. Tramèr, A. Kurakin, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel. Ensemble adversarial training: Attacks and defenses. ArXiv, abs/1705.07204, 2017.
-  J. Uesato, B. O’Donoghue, A. van den Oord, and P. Kohli. Adversarial risk and the dangers of evaluating against weak attacks. CoRR, abs/1802.05666, 2018.
-  D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Deep image prior. CoRR, abs/1711.10925, 2017.
-  C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. L. Yuille. Mitigating adversarial effects through randomization. CoRR, abs/1711.01991, 2017.
-  C. Xie, Y. Wu, L. van der Maaten, A. L. Yuille, and K. He. Feature denoising for improving adversarial robustness. CoRR, abs/1812.03411, 2018.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. In E. R. H. Richard C. Wilson and W. A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.