Adversarial attacks on neural networks pose a serious threat to safety-critical systems that rely on the high accuracies of these neural networks. The imperceptibility of additive evasion attacks makes it difficult to even detect their existence. Recent work has attempted to tackle this issue by designing defenses against such attacks, mostly focusing on a scenario where the assumption is that the attacker has significant knowledge of the victim classifier, and hence will design an attack to optimally destroy the accuracy of that particular classifier. However, there is no guarantee that the attacker will choose to do so. Furthermore, adversarial examples transfer across classifiers, and an adversary could take advantage of this property by crafting an attack based on a different classifier. The attacker would do this when having only partial knowledge about the victim classifier, or when attempting to confuse the defender on purpose. Alternatively, another scenario is that the attacker is limited in computational resources, and may be trying to attack multiple classifiers at once. This is why they would tailor the attack to only one classifier, and use that to attack all classifiers. Therefore, there is a need to evaluate the sensitivity of the defense to the data used to train it, especially in the architecture mismatch case (i.e., when the architecture of the target classifier differs from the architecture of the adversary’s classifier), since traditional strategies may not suffice .
In a defense-blind scenario, we investigate strategies to combat adversarial attacks in the presence of uncertainty about the classifier’s architecture whose gradients were used to generate the attacks. We also tackle the uncertainty caused by the variability of the attack algorithm by considering three established attacks. We empirically demonstrate the effectiveness of using an ensemble of noise simulations corresponding to all possible attacks for training a pre-processing Denoising Autoencoder (DAE).
Related Work: Many existing methods to combat adversarial attacks assume that such attacks are specifically crafted using the gradient of the victim model. Often, these methods are, without modification, employed to mitigate attacks generated using a disparate classifier’s gradient with less success than in the case where the attack is based on the victim classifier’s gradient 
. For example, defenses based on Principal Components Analysis (PCA), autoencoder-based dimensionality reduction , , and denoising autoencoders  suffer a severe degradation of performance in architecture mismatch settings. Recent work has proposed training multiple DAEs (for filtering instead of dimensionality reduction) and randomly selecting one as a defense at test time . While this may be effective in confusing the attacker, these DAEs have only been tested against very mild attacks.
The idea of natural noise simulation has been extensively investigated before the successful application of deep learning over the past decade. For example, in, a noise compensation strategy is studied for speech recognition to accommodate common scenarios where training and test noises can emanate from different environments. Also, in , an early example for a pre-processing denoising recurrent network that precedes a time delay neural network was introduced. Further, in , multiple noise simulations were performed for robust speaker identification. Recently, using the new deep learning tools for simulating and filtering noise has also been investigated, particularly for speech processing (see e.g., [10, 11]). Simulating adversarial noise and directly using it for training the final classifier (a.k.a. adversarial training ) has been investigated in multiple settings (see e.g., [13, 14, 15]), including settings where adversarial noise is used to model worst case scenarios for natural noise (see e.g., [16, 17]).
To our knowledge, when adversarial noise is used to model an attacker’s perturbation, there has been little work in studying the transferability of defenses, particularly those relying on simulating adversarial noise. In an attempt to discuss the transferability of a particular defense across multiple neural networks,  emphasizes that using a DAE is superior to adversarial training because a DAE requires training only once but can mitigate attacks on multiple classifiers that are trained to perform the same task. Using adversarial training would require retraining each of those classifiers. Not requiring classifier retraining could offer significant advantages in terms of computational cost and enabling applications (see e.g., ). Given that the autoencoder defense is a transferable defense, our work attempts to determine how to best train it, when there is uncertainty about the gradient used to generate the attack and about the attack algorithm. Further, the ultimate goal is not to achieve robustness for a particular classifier, but to achieve robustness for a particular task. In that light,  proposes training multiple classifiers to perform the same task and choosing one or more of them at test time, since it is difficult for the adversary to generate attacks using gradients of multiple classifiers. Also, they add a small Gaussian noise to the trained weights of the classifiers. However, randomly selecting one of many trained networks to perform the classification task and/or using majority vote neither prevents the adversarial examples from transferring to each of those trained networks, nor does it address the problem of uncertainty about the attack algorithm used by the adversary. Our work addresses both of these problems. Specifically, it addresses uncertainty about both the attack algorithm and the network whose gradients are used to generate the attack.
We use a DAE as a defense, as explained in Algorithm 1. To generate data for DAE training, we consider two sources of variation: the attack algorithm and the classifier architecture whose gradients are used. For the attack algorithms, we consider three possibilities: Carlini-Wagner (CW) , which is known for robustness, DeepFool (DF) , which adds a very small perturbation, and Fast Gradient Sign (FGS) , which enjoys computational efficiency. The CW attack minimizes the sum of the perturbation and a scaled cost function to find the optimum value of a variable
where is a cost function - very similar to cross-entropy - that penalizes the sample classification as the true label :
is given by the sigmoid function,is the input to the neuron of the softmax layer. The constant is chosen to be the smallest value such that , where is the solution. Gradient descent with multiple random starting points is used to find the solution.
The DF attack tries to find a perturbed sample from by iteratively approximating the classifiers to be linear, and finding the minimum -norm perturbation that leads to misclassification. Let be the sample at the iteration, , be the true label, and be the identity indicator for whether sample is classified to belong to class . The perturbation added at iteration is:
This continues till misclassification, or for a certain number of iterations. In our experiments, this number is 50.
The FGS attack adds a perturbation proportional to the sign of the gradient of the cost function with respect to the input. This perturbation is -bounded.
For the classifier architecture on which the attacks are based, the two architecture types considered are Fully Connected (FC) and Convolutional Neural Network (CNN). Only a CNN is considered for the CIFAR-10 classification task, as it is difficult to solve with an FC classifier.
Ii-a Experimental Setup
The MNIST-Digit and CIFAR-10 datasets were used for classification of 2828 and 3232 pixel images, respectively, into one of 10 classes. For fully connected networks, we use the notation FC---…- to denote that layer i has neurons, for
. The MNIST victim FC network has architecture FC-784-100-100-10, achieving an accuracy of 98.11%, and the MNIST adversary’s FC network has architecture FC-784-200-100-100-10. Each layer has a Rectified Linear Unit (ReLU) activation, except the final layer which has softmax. The MNIST adversary’s CNN architecture is shown in TableI. The MNIST victim CNN achieves an accuracy of 98.66%, and has a similar architecture to the adversary’s, but with only two convolutional layers, followed by a softmax layer. The CIFAR-10 victim CNN, shown in Table II
, achieves an accuracy of 90.44%. Note that ELU and BatchNorm refer to Exponential Linear Unit activation and Batch Normalization, respectively. Also, note that the notation (Conv 3x3x32, ELU, BatchNorm)x2 refers to using two consecutive sequences of a convolutional layer followed by ELU activation and Batch Normalization. To obtain variations of this architecture to train the CIFAR-10 defense, an extra (Conv 3x3x, ELU, BatchNorm) sequence is added after the second, fourth, or sixth such sequence, where is the number of filters in the convolutional layer preceding the first added convolutional layer. These variations are used only to train the defense, and not to generate attacks to test the defense against. The adversary’s architecture is obtained by adding such a sequence after the eighth such sequence. Data augmentation was performed prior to feeding the CIFAR-10 data into the CNN, through rotations of up to 15 degrees, width/height shifts of up to 10% of the original, and horizontal flips111Details about the denoising autoencoders and training details for all networks can be found in Appendix..
|Conv 3x3x32, ReLU|
|Conv 3x3x64, ReLU|
|Max Pool 2x2|
|Dropout (rate = 0.25)|
|FC (128 neurons), ReLU|
|Dropout (rate = 0.5)|
|Softmax (10 classes)|
|(Conv 3x3x32, ELU, BatchNorm)x2|
|Max Pool 2x2, Dropout (rate = 0.2)|
|(Conv 3x3x64, ELU, BatchNorm)x2|
|Max Pool 2x2, Dropout (rate = 0.3)|
|(Conv 3x3x128, ELU, BatchNorm)x2|
|Max Pool 2x2, Dropout (rate = 0.4)|
|(Conv 3x3x128, ELU, BatchNorm)x2|
|Max Pool 2x2, Dropout (rate = 0.4)|
|Softmax (10 classes)|
We evaluate the defense sensitivity to the choice of attacked data used for training by varying the architecture and attack algorithm. To train the DAE with data perturbed according to the FGS attack, the norm was chosen such that the perturbation is simultaneously effective and reasonably imperceptible in the attacked images. For the MNIST-Digit dataset, this was achieved by using an -norm of 1.5 when using gradients from a CNN, and an -norm of 2.5 when using gradients from an FC network, as shown in Figure 1. For the CIFAR-10 dataset, an norm of 1.7 delivered reasonably imperceptibly attacked images, as shown in Figure 2. We say that the defense is architecture-type-trained or attack-trained to refer to the architecture type or attack algorithm used to simulate the perturbations, respectively. Further, we say that the defense is ensemble-architecture-type-trained or ensemble-attack-trained when both FC and CNN architectures or when all three attack algorithms are used to simulate the perturbations, respectively222 Code available at https://codeocean.com/capsule/7339381/tree/v1.
Ii-B1 Varying the Architecture Type
Here, we vary the architecture type used to train the defense for a particular attack type. For each attack, we trained two defenses using gradients of one architecture type, and one defense trained using gradients of an ensemble of both architecture types. We only performed this with the MNIST-Digit dataset, since the architecture type cannot be varied with the CIFAR-10 dataset.
Ii-B2 Varying the Attack Algorithm
Next, we trained the defense with data attacked using gradients of a particular architecture type while varying the attack algorithm used. For the MNIST-Digit dataset, we used an FC architecture.With the CIFAR-10 dataset, we took a step further and trained the defense with data that is attacked using gradients from an ensemble of classifiers with varying CNN architectures.
Iii-1 Performance Metric
We first measure the accuracy improvement as the pre-defense accuracy subtracted from the post-defense accuracy. We then compute the percent increase in accuracy improvement when using the proposed defense from that when using traditional defenses:
where is the accuracy improvement when using our proposed defense, and is the average accuracy improvement when using all traditional defenses for which the attacker’s choice of architecture type (or attack) was not involved. When we vary the architecture type, the traditional defense considered is the defense trained with the same attack as the attacker’s, based on an architecture type different from the attacker’s. In the absence of an attack while varying the architecture type, the traditional defenses considered are the two defenses trained with a particular attack based on a single architecture type. When we vary the attack algorithm, the traditional defenses considered are the two defenses trained with an attack algorithm different from the attacker’s, and the one defense trained using the two attack algorithms that the attacker did not use. In the absence of an attack while varying the attack algorithm, the traditional defenses considered are the three defenses trained with one attack each, and the three defenses trained with an ensemble of two attacks.
Iii-2 Varying the Architecture Type
In most cases, our ensemble-architecture-type-trained defense improves the accuracy, and often significantly, as shown through the positive percent increases in Figure 3. The only case with a negative percent increase corresponds to a very weak attack that leaves the classifier achieving without defense. We also note that in case of such a weak attack, using the considered single-architecture-type-trained-defenses also worsens the accuracy compared to the no-defense case. Hence it may be smarter to not use a preprocessing defense when the attack is so weak.
With no attack, our ensemble-architecture-type-trained defense decreases the accuracy improvement by not more than 10.68%, and increases it by up to 37.7%.
Iii-3 Varying the Attack
Using our ensemble-attack-trained defense generally yields an increase in accuracy improvement compared to using the traditional defenses, as shown in Figure 4. The only instance when this accuracy improvement is negative is when a DeepFool attack is used against the CIFAR-10 classification task. As earlier, this negative accuracy improvement happens only when the attack is weak.
With no attack, our ensemble-attack-trained defense changes the accuracy improvement by +43.86% and -10.12% for the MNIST-Digit and CIFAR-10 datasets, respectively.
In our experiments, training the defense with an ensemble of architecture types made a larger impact than training the defense with an ensemble of attacks. We believe that this is because while generating attacks using different algorithms, we adjusted the hyperparameters such that the levels of perceptibility of the attacked images are similar. In conclusion, if the attacker is constrained to limit the perceptibility of the attack, then the choice of attack in training the defense does not make a significant impact, as evident from Figure4.
We also observe that it is crucial for the attacker to know the architecture type of the victim classifier, as illustrated by the pre-defense accuracies in Figure 3. From the defender’s perspective, if the attack is weak, it is in their best interest to avoid using a pre-processing defense. This makes it important to develop detection strategies to determine how strong an attack is. During test-time, if the attack is found to be weaker than a certain threshold, using a pre-processing defense should be avoided to avoid a decrease in accuracy. Other problems that we plan to consider for future work include the case when the test noise model cannot be included in training (see e.g., ), creating adaptive defenses by classifying noise types (see e.g., ) and creating a noise dictionary (see e.g., ), as well as incorporating uncertainty prediction in the defender’s model (see e.g., ).
-  N. Papernot, P. McDaniel, and I. Goodfellow. (2016, May). Transferability in machine learning: From phenomena to black-box attacks using adversarial samples. [Online]. Available: https://arxiv.org/pdf/1605.07277.pdf.
-  F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh and P. McDaniel. (2016, April). The Space of Transferable Adversarial Examples. [Online]. Available: https://arxiv.org/pdf/1704.03453.pdf
A. N. Bhagoji, D. Cullina, C. Sitawarin, and P. Mittal, “Enhancing robustness of machine learning systems via data transformations," inConference of Information Security and Systems (CISS), Princeton, NJ, USA, 2018.
-  R. Sahay, R. Mahfuz, and A. E. Gamal, “Combatting adversarial attacks through denoising and dimensionality reduction: a cascaded autoencoder approach," in Conference of Information Security and Systems (CISS), Baltimore, MD, USA, March 2019.
-  R. Sahay, R. Mahfuz, and A. E. Gamal. (2019, June). A computationally efficient method for defending adversarial deep learning attacks. [Online]. Available: https://arxiv.org/pdf/1906.05599.pdf.
-  D. Meng and H. Chen. (2017, Sep.). MagNet: a Two-Pronged Defense against Adversarial Examples. [Online]. Available: https://arxiv.org/pdf/1705.09064.pdf.
-  D. Matrouf and J.-L. Gauvain, “Model compensation for noises in training and test data," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Munich, Germany, 1997.
-  S. Moon and J.-N. Hwang, “Coordinated training of noise removing networks," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Minneapolis, MN, USA, 1993.
-  L. Zao and R. Coelho, “Colored Noise Based Multicondition Training Technique for Robust Speaker Identification," IEEE Signal Processing Letters, vol. 18, no. 11, pp. 675-678, Nov. 2011.
-  N. Tawara et al., “Postfiltering Using an Adversarial Denoising Autoencoder with Noise-aware Training," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019.
-  J. Yuan and C. Bao, “Joint Ideal Ratio Mask and Generative Adversarial Networks for Monaural Speech Enhancement," in 14th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 2018.
-  U. Shaham, Y. Yamada, and S. Negahban. Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization. [Online]. Available: https://arxiv.org/abs/1511.05432
-  B. Liu et al., “Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018.
S. E. Eskimez, K. Koishida and Z. Duan, ““Adversarial Training for Speech Super-Resolution,"IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 347-358, May 2019.
A. Pandey and D. Wang, “On Adversarial Training and Loss Functions for Speech Enhancement," inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018.
-  W.-N. Hsu et al., “Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019.
-  D. Mishra, S. Chaudhury, M. Sarkar and A. S. Soin, ““Ultrasound Image Enhancement Using Structure Oriented Adversarial Network," IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1349-1353, Sep. 2018.
-  P. Vaishnavi, K. Eykholt, A. Prakash and A. Rahmati. (2019, September). Transferable Adversarial Robustness Using Adversarially Trained Autoencoders. [Online]. Available: https://arxiv.org/pdf/1909.05921.pdf.
K. Rajaratnam and J. Kalita, “Noise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition," inIEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Louisville, KY, USA, 2018.
-  Y. Zhou,, M. Kantarcioglu and B. Xi. (2018, June). Breaking Transferability of Adversarial Samples with Randomness. [Online]. Available: https://arxiv.org/pdf/1805.04613.pdf.
-  N. Carlini and D. Wagner, “Towards Evaluating the Robustness of Neural Networks," in IEEE Symposium on Security and Privacy, San Jose, CA, USA, May 2017.
-  S. M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple and accurate method to fool deep neural networks," in
-  Ian J. Goodfellow, J. Shlens and C. Szegedy. (2014). Explaining and Harnessing Adversarial Examples [Online]. Available: https://arxiv.org/abs/1412.6572.
-  S. Pascual, M. Park, J. Serrà, A. Bonafonte and K. Ahn, “Language and Noise Transfer in Speech Enhancement Generative Adversarial Network," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018.
-  J. Zhou et al., “Training Multi-task Adversarial Network for Extracting Noise-robust Speaker Embedding," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019.
-  E. Yilmaz, H. V. Hamme and J. F. Gemmeke, “Adaptive noise dictionary design for noise robust exemplar matching of speech," in 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 2015.
-  D. Dera, G. Rasool and N. Bouaynaya, “Extended Variational Inference for Propagating Uncertainty in Convolutional Neural Networks," in IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), Pittsburgh, PA, USA, 2019.
Details of Attack Generation
All attacks were generated using the TensorFlow Cleverhans library and were untargeted attacks. For the MNIST-Digit dataset, the CW attack was generated with 4 binary search steps, a maximum of 60 iterations, a learning rate of 0.1, a batch size of 10, an initial constant of 1.0, and the abort_early parameter was set to True. For the CIFAR-10 dataset, the CW attack was generated with 6 binary search steps, a maximum of 10000 iterations, a learning rate of 0.7, a batch size of 25, an initial constant of 0.001, and the abort_early parameter was set to True.
Architecture of Denoising Autoencoders
The MNIST DAE has architecture FC-784-256-128-81-128-256-784. There is no activation in any layer except the last layer, which has sigmoid activation. The architecture of the CIFAR-10 DAE is shown in Table 1
|Conv 3x3x64, ReLU|
|Conv 3x3x32, ReLU|
|Max Pool 2x2|
|Conv 3x3x3, ReLU|
|Conv 3x3x32, ReLU|
|Conv 3x3x64, ReLU|
|Conv 3x3x64, Sigmoid|
Training Details of all Neural Networks
Table 2 shows the training details of all neural networks used in this letter. The default values were used for all hyperparameters not shown in the table.
|Parameters||MNIST FC||MNIST CNN||MNIST DAE||CIFAR CNN||CIFAR DAE|
|Loss||Categorical Crossentropy||Categorical Crossentropy||MSE||Categorical Crossentropy||MSE|