. However, recent advancements in adversarial machine learning have hindered large scale deployment of deep learning models. Szegedy et al.(2013) have shown that carefully crafted examples can be constructed from input images to generate incorrect outputs of high confidence. Furthermore, such inputs can be generated to specifically output a target class, and such an attack is known as a targeted attack. The adversarial aspect of these attacks is that the changes made to the input are small enough for a human to not detect it. For image classification tasks, this is usually achieved by constraint optimization of input image pixels under an norm to only allow a maximum perturbation limit. Figures 1 and 2 show examples of adversarial images from the MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky et al., ) datasets from all 10 of their classes.
Several notable properties of adversarial examples have been discovered recently that make the problem worthwhile. Most surprisingly, it has been shown that adversarial examples can transfer from one network to another (source to target model) (Szegedy et al., 2013). This means the attacker does not need access to the original model to attack it. A separate model can be trained to generate adversarial samples, which can then be used as adversarial inputs to the original model. These examples get stronger (lead to highly confident incorrect predictions) as the adversary’s knowledge of the target model increases. Real world examples of adversarial attacks have been explored in (Kurakin et al., 2016a). The authors show that adversarial images retain their properties after being printed physically, or recaptured using a camera. The latter point particularly poses threat to deep learning applications in autonomous driving, medical imaging etc.
Current methods in adversarial defense research have two approaches: detection and classification. Carlini and Wagner (Carlini and Wagner, 2017) show that detecting adversarial examples is a non-trivial task and cannot be done efficiently at the present. In this work, we aim to accurately classify adversarial samples without compromising accuracy on clean samples. Current methods to classify adversarial examples employ deep learning techniques that can be trained end-to-end. This allows the adversary to consider the defense as a part of the model, which can be attacked in the same way as the original model, nullifying the effect of the defense. Non deep learning based techniques have yielded promising results. The adversarial perturbations are imperceptible at the input level. However, Xie et al. (Xie et al., 2019) show that these perturbation grow when passed through a deep network and show up as noise on the feature maps. Motivated by this fact, we employ a denoising autoencoder to detect and remove these noises in feature maps.
To train the autoencoder, we use perceptual loss functions (feature losses)(Johnson et al., 2016)
, previously used in super-resolution and style transfer. These loss functions employ an additional pretrained sub-network known as the loss network. Feature maps from various intermediate layers from this network are extracted and compared pixel by pixel for the input and reconstructed output of the autoencoder. This not only ensures that the final image is clean, but that it generates clean feature maps while passing through the classification network. The loss sub-network does not have to be trained for the same classification purpose. In our experiments with the MNIST and CIFAR datasets, we use the same loss network of VGG-16(Simonyan and Zisserman, 2014)
pretrained on the ImageNet dataset. We argue that incorporating this additional sub-network makes it very difficult to generate adversarial examples by training it end to end.
The main contributions of the paper are summarized below:
We introduce a novel method to train denoising autoencoders.
We create a defense strategy that cannot be trivially broken by black box attacks by training another network end-to-end.
We achieve 38.5% accuracy on CIFAR10 and 83.0% MNIST using a powerful attack.
This work builds off of previous work in the field of adversarial defenses and machine learning. This section provides an overview of the existing related literature and the important techniques used in this study.
2.1 Adversarial training
Adversarial training is a defense where the network is trained on adversarial samples along with normal samples to achieve adversarial robustness (Goodfellow et al., 2014). It is the only defense that is universally accepted and guarantees improvement in accuracy. As a result, it is used as a strong baseline for many adversarial defenses (Xie et al., 2019). Adverserial training also addresses the trade-off in accuracy for clean and adversarial inputs for small datasets (LeCun et al., 1998; Krizhevsky et al., ) by improving accuracy on clean images, however, these results are absent in larger datasets such as ImageNet (Deng et al., 2009). A major challenge posed by adversarial training is that generating adversarial examples is computationally expensive. This results in usage of the single-shot fast gradient sign method to dynamically generate adversarial images while training, which is significantly faster. This leads to sub-par adversarial training and a heavy dependence on the attack.
2.2 Adversarial defenses
There is a popular class of adversarial defenses that rely on minimizing the difference in a chosen metric calculated for clean and adversarial samples. Adversarial logit pairing(Kannan et al., 2018) aims to minimize the mean-squared distance between the logits of the classification network. Xu et al. (Xu et al., 2018)
reduce the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. They compare the model’s prediction on squeezed inputs of clean and adversarial samples to detect the difference. The author’s in(Liang et al., 2018) assume that perturbation is a form of noise and use scalar quantization and smoothing spatial filters to denoise the inputs. Comparison with classification results of noised vs denoised version of input detects the adversary. While these methods achieve great results, they target specific defenses and have been broken trivially (Carlini and Wagner, 2017).
2.3 Denoising autoencoders
Denoising autoencoders are an area of defense most relevant to this work. Several recent defensive approaches rely on input preprocessing and transformation techniques. Particularly, flavors of denoising autoencoders have yielded promising results (Liao et al., 2018). The neural networks mimic the process of adversarial training as a preprocessing step, instead of making the network parameters robust to adversarial inputs, a special network is trained to filter out adversarial noise such that the original classifier can process a clean image. Vanilla autoencoders suffer from the error amplification effect, in which residual adversarial noise is progressively amplified, leading to incorrect output classification. Most relevant to our approach, Liao et al. (2018) use high-level representation guided denoiser to overcome this problem. The high-level representation is a mean-squared loss of the output vectors of clean and adversarial images activated by the target model. They use a U-NET (Ronneberger et al., 2015) architecture for the autoencoder.
2.4 Feature noise
What makes the evasion attacks adversarial is that they are quasi-imperceptible to humans. The noise injected in the training sample is small and hard to detect in preprocessing stages. However, it has been shown that (Xie et al., 2019) these noises propagate through the network and are seen as adversarial noise on the feature maps of the target model. Figure 3 shows an example of adversarial noise present in feature maps generated using the Resnet-50 network (He et al., 2016). The noise is generally constrained by the attacker in the input images using either a matrix norm (, etc.) or an upper bound to change per pixel (). However, this constraint is missing on the feature maps of the network, resulting in the aforementioned propagation. The detection and removal of these feature noises serves as a strong motivation of adversarial defense research.
2.5 Perceptual and feature losses
The idea of using a pretrained network to optimize multiple loss function at once for another network is not novel. Perceptual loss functions were first discovered by Johnson et al. (Johnson et al., 2016) for image super-resolution and style transfer. To calculate perceptual losses, a loss network pretrained for image classification is used to identify differences between content and style of an image (Gatys et al., 2016). For neural style transfer, the pretrained loss network is used to measure differences (mean-squared error) between the feature maps of the content image and the style image. The motivation behind these perceptual losses is to gradually separate the contents and the style of an input image using feature maps from a pretrained network. Feature losses encapsulate the same concept as perceptual losses and only differ in the use case. Convolutional kernels are extracted as feature maps of any input image from a pretrained network and are compared using standard mean-squared losses. Notably, the parameters of the loss sub-network remain constant while training the autoencoder.
3.1 Threat model
There are four possible threat models in adversarial attacks as described by Carlini and Wagner (Carlini and Wagner, 2017):
A zero knowledge attacker (black-box attack) that generates adversarial samples on a model and is not aware of any defense in place.
A perfect knowledge attacker (white-box attack) who is aware of the model architecture and parameters and also aware of the parameters and type of defense in place.
A limited knowledge attacker (grey-box attack) is aware of the neural network architecture and parameters, but unaware of the defense in place.
A variant of the gray box attack. The attacker is aware of the defense in place but unaware of the network architecture and parameters.
As a threat model, we consider a realistic grey-box scenario where the attacker has access to the model weights and architecture but is unaware of the defense in place. This assumption gives the benefit to the attacker, although it is unlikely that the attacker has access to the model parameters. Under this threat model, We evaluate our defense on 2 popular attacks: A one shot Fast Gradient Sign Method (FGSM) attack and a more powerful iterative Projected Gradient Descent (PGD).
3.2 Generating adversarial samples
Adversarial examples for both mentioned attacks are precomputed before training the autoencoder. The autoencoder is trained exclusively on adversarial samples as this is empirically shown to be most effective (Zhang et al., 2019). Experiments in Section 4 show that this does not affect accuracy on clean samples.
3.2.1 Fast Gradient Sign Method
Goodfellow et al. first proposed the FGSM attack (Goodfellow et al., 2014) as a fast method to optimize the input image to convert it to an adversary. They compute gradients only once and perform a one step optimization. The main idea is to change every input pixel in the optimal direction (+ or -) upto a given perturbation limit (). If x is the original input, then the perturbed image is calculated as follows:
3.2.2 Projected Gradient Descent
The second attack we use is a 100 iteration version of the projected gradient descent (PGD) (Madry et al., 2017). Projected gradient descent is an extremely powerful first order attack as it removes any time-bound constraints on the attacker. PGD is an evolution of the iterative fast gradient sign method (IFGSM). The IGFSM attack is simply an iteration extension of the FGSM where the inputs pixels are restricted to a maximum perturbation of by clamping the input pixel space. The optimization problem is the following:
where is a chosen hyper-parameter much larger or smaller than . The subscript ranges from 1 to as and gets more difficult to defend as the iterations proceed. PGD starts with a random input near the original input space and performs multiple iterations of IFGSM attack to transform into an adversarial example.
3.3 Preprocessing network
The preprocessing network described in this section consists of 2 main parts: The denoising autoencoder and the loss sub-network. The autoencoder comprises of an encoder and a decoder. The encoder takes in an adversarial image () as the input and down-scales the image into a low dimensional space (). The decoder takes in this down-scaled decoder output and produces a clean (denoised) image of the original dimension. Figure 4 shows the architecture of the autoencoder used in our experiments.
The loss-sub network is used to generate feature maps for the loss functions that govern the training of the autoencoder. It is a pretrained image classification network using which feature maps of images can be extracted. In this work, we use a VGG-16 network pretrained on ImageNet. The feature maps of clean and adversarial samples of an input image are extracted and a linear combination of their mean-squared errors is minimized by the autoencoder, along with the image reconstruction loss. For this study, we use 3 feature maps from the VGG-16 network.
: 3 layers behind the softmax layer but before the max-pooling,: 7 layers behind the softmax layer, and : 10 layer behind the softmax. The resolution for these feature maps are , and respectively. Figure 5 shows the architecture for the VGG-16 loss network that we use and the exact layers that from which the feature maps are extracted.
Reducing the mean-squared error of multiple feature maps forces the autoencoder to denoise the input image such that the output is not only visually similar to the input, but also generates similar feature maps on a deep network to minimize the perturbations. The complete objective function for the autoencoder is given by:
where x is the input image, is the corresponding denoised adversarial sample generated, and is the function that produces the feature map for corresponding to the
layer from the softmax in the VGG-16 network. The final term in the objective is the mean-squared image reconstruction loss. The three mean-squared errors and the reconstruction error are linearly weighted in the objective function with use of hyperparameters, , and as suggested by Johnson et al. (Johnson et al., 2016).
We use 2 popular datasets in the field on adversarial machine learning research. MNIST is a database of gray-scale images of digits 0-9. It consists of 60000 training images and 10000 testing images of resolution equally distributed among the 10 classes. CIFAR-10 is a dataset of RGB images from 10 classes of animals (e.g., frog, bird) and vehicles (e.g., aeroplane, truck). It consists of 50000 training images and 10000 testing images of resolution. Images from this dataset are converted to grayscale and resized to to match the input format for the neural networks used in this work.
4.2 Implementation details
We use two networks for this work. The loss network is a VGG-16 model pretrained on ImageNet for the image classification task. The architecture of the network is shown in Figure 4222 Training details and hyperparameters can be found on github.com/pytorch/vision/blob/master/torchvision/models/vgg.py
Training details and hyperparameters can be found on github.com/pytorch/vision/blob/master/torchvision/models/vgg.py.
The autoencoder consists of an encoder and a decoder. The encoder takes in a image as the input. It consists of a2010) and max-pooling (kernel 2 stride 2). This is followed by another convolution that goes from 16 to 8 channels (stride 2 unit padding). This convolution is activated by Relu and then max-pooling (kernel 2 stride 1). The output of the encoder is which is fed to the decoder. The decoder upsamples this encoder output via 3 successive transposed convolutions. The first 2 transposed convolutions are followed by the Relu activation whereas the last one is followed by the inverse tangent function. The three transposed convolutions have the following configurations in order: 1) kernel size 3, stride 2, no padding, from 8 to 16 channels, 2) kernel size 5, stride 3, unit padding, 16 to 8 channels, 3) kernel size 2, stride 2, unit padding, 8 to 1 channel.
The autoencoder was trained for 100 epochs with a batch size of 128. A learning rate of 0.001 was used and the network was optimized using the adam optimizer(Kingma and Ba, 2014). A weight decay of 0.00001 was used for regularizing the network. The hyperparameters , , and were chosen to be 0.00048, 0.00024, 0.00012 and 0.0013. These values are inversely related to the number of elements in the feature map. For e.g., corresponds to the term containing in Equation 4. The function produces a size feature map. Thus, . This is done so that each term in the objective functions is weighted according to the number of elements its feature map has.
The adversarial samples generated using FGSM use an norm with 0.2. Samples generated using PGD also use an norm with 0.2. PGD was carried out for a 100 iterations with an input step size of 0.1. Both attacks were carried out as untargeted attacks, i.e., a specific output class was not forced.
The model that we attack is a simple 4 layer convolutional neural network. There are 2 convolution:relu:max-pooling layers followed by 2 fully connected layers. The first convolution is a convolution with a unit stride that goes from 1 to 20 channels. The second convolution is a convolution with a unit stride that goes from 20 to 50 channels. The max-pooling layers are a size 2 stride 2 layer. The first fully connected layer takes in 800 inputs (
) and down-scales to 500 inputs. The final fully connected layer takes these 500 inputs and down-samples to give the 10 output neurons. This baseline architecture gives 99.5% accuracy on the MNIST dataset and 92.9% accuracy on the CIFAR10 dataset.
We perform a total of 4 experiments as outlined in previous sections: PGD on MNIST, PGD on CIFAR-10, FGSM on MNIST, and FGSM on CIFAR10. We achieve results comparable to Madry et al. (Madry et al., 2017) and the autoencoder approach by Liao et al. (Liao et al., 2018). Both of the aforementioned defenses have been broken by slightly modifying the objective functions given in Equations 2 and 3. The complete results for the 4 experiments have been given in Tables 1 and 2. All experiments have been carried out using the methodology detailed in 4.2.
In this paper, we have developed an alternative approach and objective function to train denoising autoencoders as an adversarial defense. The novel objective functions are motivated by the fact that although adversarial perturbations are imperceptible at the input level, they grow and propogate forward in a feed-forward or a deep convolutional network and show up as noise in the feature maps generated by the adversarial samples on the given network. Using an additional pretrained loss network makes it non-trivial for attackers to break it using trivial backpropagation. Several design choices made in this study are open for future works and may lead to potential improvements, such as: architecture of the autoencoder, choice of the loss sub-network and its effect on accuracy, choice of feature maps within the loss sub-network etc. Evaluations are conducted on grey-box scenarios using FGSM, and a powerful 100 iteration PGD. The results achieved show promising initial results on two popular datasets MNIST and CIFAR-10, suggesting that this approach could lead to a viable long term solution in the field of adversarial defense research.
I would like to thank Rui Wang and Praneeth Medepalli who reviewed this paper and gave valuable insights. (Nicolae et al., 2018)
Proceedings of the Genetic and Evolutionary Computation Conference Companion on - GECCO ’19. External Links: Cited by: §1.
Adversarial examples are not easily detected.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec ’17. External Links: Cited by: §1, §2.2, §3.1.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §2.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. External Links: Cited by: §1.
- A neural algorithm of artistic style. Journal of Vision 16 (12), pp. 326. External Links: Cited by: §2.5.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2.1, §3.2.1.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §1, §2.4.
- Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, External Links: Cited by: §1, §2.5, §3.3.
- Adversarial logit pairing. External Links: Cited by: §2.2.
- Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.2.
-  () CIFAR-10 (canadian institute for advanced research). External Links: Cited by: §1, §2.1.
- Adversarial examples in the physical world. External Links: Cited by: §1.
- Adversarial machine learning at scale. ArXiv abs/1611.01236. Cited by: §3.2.1.
- Gradient-based learning applied to document recognition. In Proceedings of the IEEE, Vol. 86, pp. 2278–2324. External Links: Cited by: §1, §2.1.
- Detecting adversarial image examples in deep neural networks with adaptive noise reduction. IEEE Transactions on Dependable and Secure Computing, pp. 1–1. External Links: Cited by: §2.2.
- Defense against adversarial attacks using high-level representation guided denoiser. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §2.3, §4.3, Table 1, Table 2.
- Towards deep learning models resistant to adversarial attacks. ArXiv abs/1706.06083. Cited by: §3.2.2, §4.3, Table 1, Table 2.
Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, USA, pp. 807–814. External Links: Cited by: §4.2.
- Adversarial robustness toolbox v1.0.1. CoRR 1807.01069. External Links: Cited by: Acknowledgements.
- You only look once: unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §1.
- U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Cited by: §2.3.
- Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §1.
- Unsupervised domain alignment to mitigate low level dataset biases. External Links: Cited by: §1.
- Intriguing properties of neural networks. CoRR abs/1312.6199. Cited by: §1, §1.
- Feature denoising for improving adversarial robustness. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §1, Figure 3, §2.1, §2.4.
- Feature squeezing: detecting adversarial examples in deep neural networks. Proceedings 2018 Network and Distributed System Security Symposium. External Links: Cited by: §2.2.
- Theoretically principled trade-off between robustness and accuracy. External Links: Cited by: §3.2.