Log In Sign Up

Effects of Loss Functions And Target Representations on Adversarial Robustness

by   Sean Saito, et al.

Understanding and evaluating the robustness of neural networks against adversarial attacks is a subject of growing interest. Attacks proposed in the literature usually work with models that are trained to minimize cross-entropy loss and have softmax activations. In this work, we present interesting experimental results that suggest the importance of considering other loss functions and target representations. Specifically, (1) training on mean-squared error and (2) representing targets as codewords generated from a random codebook show a marked increase in robustness against targeted and untargeted attacks under white-box and black-box settings. Our results show an increase in accuracy against untargeted attacks of up to 98.7% and a decrease of targeted attack success rates of up to 99.8%. For our experiments, we use the DenseNet architecture trained on three datasets (CIFAR-10, MNIST, and Fashion-MNIST).


page 1

page 7

page 8


Towards adversarial robustness with 01 loss neural networks

Motivated by the general robustness properties of the 01 loss we propose...

Unbounded Output Networks for Classification

We proposed the expected energy-based restricted Boltzmann machine (EE-R...

Impact of Attention on Adversarial Robustness of Image Classification Models

Adversarial attacks against deep learning models have gained significant...

Targeted Mismatch Adversarial Attack: Query with a Flower to Retrieve the Tower

Access to online visual search engines implies sharing of private user c...

Robustness of Bayesian Neural Networks to White-Box Adversarial Attacks

Bayesian Neural Networks (BNNs), unlike Traditional Neural Networks (TNN...

BreakingBED – Breaking Binary and Efficient Deep Neural Networks by Adversarial Attacks

Deploying convolutional neural networks (CNNs) for embedded applications...

Broadly Applicable Targeted Data Sample Omission Attacks

We introduce a novel clean-label targeted poisoning attack on learning m...

1 Introduction

Neural networks produce state-of-the-art results across a large number of domains ([16], [36], [35], [12]). Despite increasing adoption of neural networks in commercial settings, recent work has shown that such algorithms are susceptible to inputs with imperceptible perturbations meant to cause misclassification ([33], [10]). It is thus important to investigate additional vulnerabilities as well as defenses against them.

Figure 1: Top. Adversarial images generated by the targeted white-box Madry et al. attack using the CIFAR-10 test dataset and a DenseNet model with one-hot target representations and softmax outputs trained to minimize cross-entropy. Bottom. Adversarial images generated under the identical setting, but using a DenseNet model trained to minimize the mean-squared error between tanh outputs and codeword targets. We choose the smallest which achieve success rates of 100% against each model (0.03 and 0.35, respectively).

In this paper we investigate the problem of adversarial attacks on image classification systems. Attacks so far have only considered the conventional neural network architecture which outputs softmax predictions and is trained by minimizing the cross-entropy loss function. We thus propose and evaluate the robustness of neural networks against adversarial attacks with the following modifications:

  • Train the model to minimize mean-squared error (MSE), rather than cross-entropy.

  • Replace traditional one-hot target representations with codewords generated from a random codebook.

We evaluate our proposed modifications from multiple angles. First, we measure the robustness of the modified model using attacks under multiple threat scenarios. Secondly, we introduce an attack which, without sacrificing its efficacy towards conventional architectures, is tailored to our proposed modifications. Finally, we conduct spectral analysis on the model’s parameters to compute their upper Lipschitz bounds, a measure that has been shown to be correlated with a model’s robustness. Our results in Section 5 demonstrate that, across all three evaluations, our proposed model displays increased robustness compared to its conventional counterpart.

2 Background

2.1 Neural networks

A neural network is a non-linear function that maps data to targets , where are the dimensions of the input and target spaces, respectively, and represents the parameters of the neural network. For conventional neural networks and classification tasks, is typically a one-hot representation of the class label and is the number of classes in the dataset. In this work, we use the DenseNet architecture [13] as the existing benchmark, which has recently produced state-of-the-art results on several image datasets.

2.2 Adversarial examples

The goal of an adversarial attack is to cause some misclassification from the target neural network. In particular, [33] has shown that it is possible to construct some by adding minimal perturbations to the original input such that the model misclassifies . Here, is commonly referred to as an adversarial example, while the original data is referred to as a clean example. Apart from image classification, adversarial attacks have been proposed in both natural language and audio domains ([6], [2], [40]).

2.3 Attacks


We explore two adversarial settings, namely white-box and black-box scenarios. In the white-box setting, the attacker has access to and utilizes the model’s parameters, outputs, target representations, and loss function to generate adversarial examples. In the black-box scenario, the attacker has no access to the model’s parameters or specifications and only has the ability to query it for predictions. In this work, we employ transfer attacks, a type of black-box attack where adversarial examples are generated using a proxy model which the adversary has access to.


There are mainly two types of attacks. In a targeted attack, an adversary generates an adversarial example so that the target model returns some target class . A targeted attack is evaluated by its success rate, which is the proportion of images for which the target class was successfully predicted (the lower the better from the perspective of the defense). On the other hand, in an untargeted attack, the attacker causes the model to simply return some prediction . It is evaluated by the accuracy of the target model, which denotes the proportion of images which failed to get misclassified (the higher the better from the perspective of the defense).

The following sections describe the attacks used in this work.

Fast Gradient Sign Method (FGSM).

The Fast Gradient Sign Method [10], one of the earliest gradient-based attacks, generates adversarial examples via:

where is the loss function of the neural network, is the target class, and is a parameter which controls the magnitude of the perturbations made to the original input . The gradient, which is taken w.r.t the input, determines which direction each pixel should be perturbed in order to maximize the loss function and cause a misclassification.

Basic Iterative Method (BIM).

The Basic Iterative Method, proposed by [18], applies FGSM iteratively to find more effective adversarial examples.

Momentum Iterative Method (MIM).

The Momentum Iterative Method [9]

combines iterative gradient-based attacks with the accumulation of a velocity vector based on the gradient of the loss function.

L-BFGS Attack.


proposed the L-BFGS attack, the first targeted white-box attack on convolutional neural networks, which solves the following constrained optimization problem:


The above formulation aims to minimize two objectives; the left term measures the distance ( norm) between the input and the adversarial example, while the right term represents the cross-entropy loss. It is used only as a targeted attack.

Deep Fool.

The Deep Fool attack, proposed by [24]

, is an attack which imagines the decision boundaries of neural networks to be linear hyperplanes and uses an iterative optimization algorithm similar to the Newton-Raphson method to find the smallest

perturbation which causes a misclassification. It is used only as an untargeted attack.

Madry et al.

[22] proposed an attack based on projected gradient descent (PGD), which relies on local first order information of the target model. The method is similar to FGSM and BIM, except that it uses random starting positions for generating adversarial examples.

Carlini & Wagner L2 (CWL2)

. The Carlini & Wagner L2 attack [5]

follows an optimization problem similar to that of L-BFGS but replaces cross-entropy with a cost function that depends on the pre-softmax logits of the network. In particular, the attack solves the following problem:


where is the perturbation made to the input and is the objective function:

Here, represents the pre-softmax logits of the network. In short, the attack aims to maximize the logit value of the target class while minimizing the norm of the input perturbations.

3 Improving adversarial robustness

In this work we have two proposals. First, we propose changes to the conventional neural network architecture and target representations to defend against adversarial attacks described in Section 2.3. Second, we propose a modified, more effective CWL2 attack that is specifically tailored to our proposed defense.

3.1 Training on mean-squared error

Instead of the conventional cross-entropy loss, we propose to use MSE to compute the error between the output of the model and the target , where is the set of target representations for all classes. During inference, we select the output class for which its target representation yields the smallest euclidean distance to .

3.2 Randomized target representations

Instead of using one-hot encoding as target representations, we represent each target class as a codeword from a random codebook. Specifically, the

target representations corresponding to the

classes are sampled once at the beginning of training from a uniform distribution

based on a secret key. To match the representation space of the network output and the targets, the conventional softmax layer is replaced with a tanh activation with


3.3 Modified CWL2 attack

The Carlini & Wagner L2 attack makes several assumptions about the target network’s architecture based on its cost function mentioned in Section 2.3, namely that the highest logit value corresponds to the most likely class. However, applying our proposed neural network modifications breaks such assumptions, for the output of the network would be tanh activations and the length of the output would not correspond to the number of classes in the dataset. We thus propose a simple modification to the CWL2 attack where the cost function considers the distance in some metric space between the logits and the targets:

Like with the Carlini & Wagner L2 attack, if and only if the model predicts the target class. Using the change-of-variables formulation utilized in [5] to enforce box constraints on the perturbations, our attack finds some which optimizes the following objective:

where is a trade-off constant that controls the importance of the size of perturbations (larger values of allow for larger distortions). For our experiments, we have defined as the euclidean distance.

3.4 Lipschitz bounds and robustness

Earlier works have suggested that the sensitivity of neural networks towards adversarial perturbations can be measured with the upper Lipschitz bound of each network layer [33]. Parseval Networks [7], for example, have introduced a layer-wise regularization technique for improving robustness by enforcing smaller global Lipschitz bounds. More specifically, [7] have shown that:

where , and are the upper Lipschitz bounds of and , respectively. In other words, the efficacy of an adversarial attack depends on the generalization error of the target model as well as the Lipschitz bounds of its layers. This suggests that smaller Lipschitz bounds indicate a more robust model. For both fully-connected and convolutional layers, this can be measured by calculating their operator norms. The operator norm of the

-th fully-connected layer is simply the largest singular value of the weight matrix. The Lipschitz constant of the

-th layer is then:

For convolutional kernels, we rely on the formulation in [33]

, which involves applying the two-dimensional discrete Fourier Transform to find the largest singular values.

Section 5.6 presents empirical results which demonstrate that simply changing the loss function from cross-entropy to mean-squared error can yield model parameters with significantly smaller Lipschitz bounds.

4 Experimental setup

In this section we describe the evaluation datasets, evaluation models and adversarial image generation process.

4.1 Datasets

CIFAR-10 [16] is a small image classification dataset with 10 classes. It contains 60,000 thumbnail-size images of dimensions 32x32x3, of which 10,000 images are withheld for testing.

MNIST [21] is another image classification dataset containing monochromatic thumbnails (28x28) of handwritten digits. It is comprised of 60,000 training images and 10,000 testing images.

Fashion-MNIST [38] is a relatively new image classification dataset containing thumbnail images of 10 different types of clothing (shoes, shirts, etc.) which acts as a drop-in replacement to MNIST.

4.2 Models evaluated

We use three variants of the DenseNet model to generate adversarial examples:

  • O:SOFTMAX:CE refers to a DenseNet model with softmax activations trained on cross-entropy loss and one-hot target representations.

  • O:SOFTMAX:MSE refers to a DenseNet model with softmax activations trained on MSE and one-hot target representations.

  • R:TANH:MSE refers to a DenseNet model with tanh activations trained on MSE using codeword target representations. We used a codeword length of .

We have evaluated the robustness of the R:TANH:MSE model with different codeword lengths (64, 256, and 1024) but found no significant discrepancies in the results.

Attack Modified Parameter
Basic Iterative Method epsilon ()
Carlini & Wagner L2 initial constant ()
Deep Fool max iterations ()
Fast Gradient Sign Method epsilon ()
L-BFGS Attack initial constant ()
Madry et al. epsilon ()
Momentum Iterative Method epsilon ()
Table 1: Parameters changed for each attack
Attack Parameters
Basic Iterative Method eps_iter nb_iter
0.05 10
Carlini & Wagner L2 binary_search_steps max_iterations
5 1000
Deep Fool nb_candidate overshoot
10 0.02
L-BFGS binary_search_steps max_iterations
5 1000
Madry et al. eps_iter nb_iter
0.01 40
Momentum Iterative Method eps_iter nb_iter
0.06 10
Table 2: Parameters held constant for each attack
CWL2 () MIM () Deep Fool ()
Setting 0.01 0.1 1 10 0.01 0.05 0.1 0.2 10 20 30 40
O:SOFTMAX:CE 0.022 0.022 0.022 0.022 0.682 0.043 0.041 0.041 0.159 0.049 0.034 0.031
O:SOFTMAX:MSE 0.078 0.044 0.039 0.039 0.838 0.595 0.509 0.467 0.112 0.069 0.065 0.061
R:TANH:MSE 0.583 0.584 0.586 0.585 0.919 0.701 0.593 0.536 0.583 0.582 0.582 0.582
O:SOFTMAX:CE 0.008 0.008 0.008 0.008 0.994 0.661 0.012 0.007 0.009 0.008 0.008 0.008
O:SOFTMAX:MSE 0.897 0.182 0.123 0.118 0.997 0.986 0.956 0.831 0.100 0.074 0.059 0.049
R:TANH:MSE 0.995 0.995 0.975 0.983 0.995 0.995 0.994 0.973 0.815 0.815 0.815 0.815
O:SOFTMAX:CE 0.041 0.041 0.041 0.041 0.196 0.038 0.035 0.034 0.049 0.042 0.041 0.041
O:SOFTMAX:MSE 0.156 0.076 0.057 0.049 0.836 0.304 0.211 0.142 0.082 0.064 0.059 0.056
R:TANH:MSE 0.946 0.942 0.946 0.945 0.902 0.691 0.574 0.568 0.935 0.935 0.935 0.935
BIM () FGSM () Madry et al. ()
Setting 0.01 0.05 0.1 0.2 0.01 0.05 0.1 0.2 0.02 0.04 0.08 0.1
O:SOFTMAX:CE 0.751 0.053 0.042 0.042 0.743 0.291 0.193 0.139 0.301 0.050 0.041 0.041
O:SOFTMAX:MSE 0.807 0.424 0.240 0.174 0.879 0.729 0.666 0.535 0.790 0.707 0.668 0.608
R:TANH:MSE 0.850 0.634 0.390 0.213 0.923 0.699 0.604 0.451 0.923 0.897 0.877 0.839
O:SOFTMAX:CE 0.994 0.628 0.015 0.008 0.994 0.949 0.654 0.227 0.983 0.809 0.263 0.008
O:SOFTMAX:MSE 0.997 0.929 0.490 0.196 0.997 0.988 0.985 0.774 0.992 0.983 0.975 0.896
R:TANH:MSE 0.995 0.882 0.429 0.196 0.995 0.995 0.918 0.332 0.995 0.995 0.995 0.993
O:SOFTMAX:CE 0.564 0.038 0.037 0.036 0.659 0.321 0.225 0.147 0.046 0.036 0.033 0.029
O:SOFTMAX:MSE 0.815 0.296 0.176 0.142 0.882 0.509 0.362 0.224 0.731 0.542 0.425 0.315
R:TANH:MSE 0.799 0.233 0.089 0.051 0.905 0.671 0.389 0.185 0.901 0.863 0.829 0.802
Table 3: Robustness of each model against each untargeted white-box attack. The table below reports the accuracy (the higher the better) of each model against each attack.

4.3 Generating adversarial examples

For each dataset mentioned in Section 4.1, we train a model on the training set and generate adversarial examples using the test set. For targeted attacks, we randomly sample a target class for each image in the test set.

We evaluate each model’s (listed in Section 4.2) robustness against attacks (listed in Table 1) under the white-box setting. For the R:TANH:MSE model, the attacker has access to the codeword representations. We also evaluate model robustness against transfer attacks, a type of black-box attack where adversarial examples are generated using a proxy model which the adversary has access to. Finally, we further measure the robustness of our proposed model using the modified CWL2 attack.

All experiments are implemented using TensorFlow


, a popular framework for building deep learning algorithms.

4.3.1 Attack parameters

For a given attack, we generate adversarial examples across a range of values for a particular parameter which controls the magnitude of the perturbations made. Table 1 lists the parameters which are modified for each attack, whereas Table 2 lists the parameters held constant. We use the default values defined in Cleverhans for our constant parameters.

4.3.2 Adapting attacks to our proposed techniques

The attacks described in Section 2.3 are implemented using the Cleverhans library [25]. By default, the attacks assume that the model outputs softmax predictions and that the targets are represented as one-hot vectors. Hence the internal loss function for some attacks (e.g. gradient-based iterative attacks) is predefined as cross-entropy. However, because the cross-entropy loss function is not compatible with the R:TANH:MSE model, we have adapted the library to use mean-squared error when the target model has also been trained on mean-squared error. These adaptations are important in preserving the white-box assumption of each attack.

5 Experimental observations

In this section, we present and analyze the performance of the evaluation models under different attack scenarios: untargeted and targeted attacks (Section 5.2), black-box attacks (Section 5.3), and our modified CWL2 attack (Section 5.4). Benchmark performances on the original datasets are presented in Section 5.1.

5.1 Clean test performance

Table 4 lists the accuracy of each model across each clean test dataset. We observe minimal differences in accuracies across the models, and hence our proposed modifications can maintain state-of-the-art classification performances.

O:SOFTMAX:CE 0.933 0.996 0.948
O:SOFTMAX:MSE 0.931 0.997 0.948
R:TANH:MSE 0.930 0.996 0.945
Table 4: Accuracy on each clean test dataset.
L-BFGS () BIM () Madry et al. ()
Setting 0.01 0.1 1 10 0.1 0.2 0.3 0.4 0.04 0.06 0.08 0.1
O:SOFTMAX:CE 1.00 1.00 1.00 1.00 0.997 1.00 1.00 1.00 0.934 0.998 1.00 1.00
O:SOFTMAX:MSE 0.667 0.864 0.955 0.994 0.461 0.624 0.658 0.664 0.266 0.343 0.402 0.441
R:TANH:MSE 0.272 0.475 0.554 0.564 0.230 0.337 0.353 0.353 0.242 0.345 0.426 0.467
O:SOFTMAX:CE 1.00 1.00 1.00 1.00 0.851 0.992 0.997 0.999 0.057 0.464 0.828 0.94
O:SOFTMAX:MSE 0.040 0.536 0.92 0.991 0.316 0.539 0.597 0.612 0.008 0.042 0.163 0.269
R:TANH:MSE 0.045 0.457 0.72 0.776 0.057 0.129 0.169 0.184 0.007 0.068 0.154 0.245
O:SOFTMAX:CE 1.00 1.00 1.00 1.00 0.957 1.00 1.00 1.00 0.999 1.00 1.00 1.00
O:SOFTMAX:MSE 0.571 0.87 0.97 0.992 0.457 0.581 0.600 0.603 0.464 0.589 0.648 0.659
R:TANH:MSE 0.644 0.808 0.826 0.832 0.807 0.926 0.938 0.940 0.626 0.724 0.794 0.834
CWL2 () MIM () FGSM ()
Setting 0.01 0.1 1 10 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
O:SOFTMAX:CE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.445 0.316 0.231 0.182
O:SOFTMAX:MSE 0.756 0.842 0.861 0.867 0.351 0.433 0.459 0.468 0.046 0.061 0.071 0.083
R:TANH:MSE 0.368 0.362 0.361 0.346 0.095 0.136 0.137 0.160 0.028 0.044 0.081 0.082
O:SOFTMAX:CE 1.00 1.00 1.00 1.00 0.908 0.997 0.998 0.998 0.095 0.131 0.115 0.113
O:SOFTMAX:MSE 0.176 0.592 0.677 0.669 0.153 0.319 0.347 0.357 0.014 0.028 0.052 0.067
R:TANH:MSE 0.006 0.006 0.006 0.002 0.023 0.040 0.052 0.050 0.007 0.031 0.047 0.067
O:SOFTMAX:CE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.213 0.158 0.122 0.109
O:SOFTMAX:MSE 0.658 0.812 0.845 0.851 0.394 0.396 0.394 0.382 0.073 0.085 0.102 0.107
R:TANH:MSE 0.583 0.592 0.576 0.548 0.114 0.140 0.143 0.147 0.048 0.081 0.088 0.097
Table 5: Robustness of each model against each targeted white-box attack. The table below reports the success rates (the lower the better) of each attack.

5.2 Untargeted and targeted attacks

Table 3 lists the accuracies of the models against untargeted white-box attacks. Both O:SOFTMAX:MSE and R:TANH:MSE models demonstrate higher accuracies on the adversarial examples compared to the O:SOFTMAX:CE model; we observe an increase in accuracies of up to 98.7%. Similar results can be observed in Table 5, where the O:SOFTMAX:MSE and R:TANH:MSE models achieve a consistent decrease in attack success rates of up to 99.8%.

5.3 Black box attacks

Table 6 shows the accuracies of transfer attacks against the O:SOFTMAX:MSE and R:TANH:MSE models. Our proposed models demonstrate more robustness towards black-box attacks compared to the white-box versions with the same configurations. Though this is expected behavior, it is imperative to evaluate a defense under multiple threat scenarios.

Setting CWL2 Deep Fool MIM
O:SOFTMAX:MSE 0.483 0.451 0.895
R:TANH:MSE 0.612 0.617 0.926
O:SOFTMAX:MSE 0.996 0.984 0.997
R:TANH:MSE 0.996 0.973 0.995
O:SOFTMAX:MSE 0.937 0.933 0.839
R:TANH:MSE 0.952 0.946 0.935
Table 6: Accuracy on adversarial data generated under the black-box setting. All adversarial examples are generated using the O:SOFTMAX:CE model. Both the O:SOFTMAX:MSE and R:TANH:MSE models show higher accuracy towards black-box attacks compared to untargeted white-box attacks.
CWL2 () Ours ()
Setting 0.1 1.0 0.1 1.0
O:SOFTMAX:CE 1.000 1.000 1.000 1.000
R:TANH:MSE 0.368 0.362 0.859 0.868
O:SOFTMAX:CE 1.000 1.000 1.000 1.000
R:TANH:MSE 0.006 0.006 0.715 0.772
O:SOFTMAX:CE 1.000 1.000 1.000 1.000
R:TANH:MSE 0.583 0.592 0.798 0.829
Table 7: Success rates of the targeted Carlini & Wagner L2 attack and our tailored attack on the O:SOFTMAX:CE and R:TANH:MSE models.

5.4 Modified CWL2 attack

Table 7 compares our proposed attack with the CWL2 attack. The results show that our attack maintains its efficacy against O:SOFTMAX:CE models while significantly increasing its success rate against the R:TANH:MSE model up to 70.9%. We note that increasing the initial constant for our attack yields increased success rates, which is aligned with the intuition that the parameter controls the importance of the attack’s success as highlighted in Section 3.3. We also observe that, despite the increase in the attack’s efficacy, the R:TANH:MSE model displays more robustness compared to the O:SOFTMAX:CE model, with a decrease in success rates of up to 28.5%.

5.5 Distortion vs. performance

On page 1, Figure 1 displays adversarial images generated from targeted white-box Madry et al. attacks on the O:SOFTMAX:CE and R:TANH:MSE models respectively. We choose the lowest for which the attack achieves success rates of 100%. It is clear that the R:TANH:MSE model requires much larger perturbations for an attack to achieve the same success rates as against the O:SOFTMAX:CE model.

Figure 2 displays adversarial images generated using the Momentum Iterative Method against both O:SOFTMAX:CE and R:TANH:MSE models where . We observe that the R:TANH:MSE model is robust even against adversarial images where the perturbations are clearly perceptible to humans.

Finally, we visualize adversarial examples generated using our modified CWL2 attack and the R:TANH:MSE model in Figure 3, where the attack achieves higher success rates compared to the original attack. The perturbations made to the images are much less perceptible compared to the adversarial examples displayed in Figures 1 and 2.

Figure 2: Top. Adversarial images generated for MNIST using the targeted MIM attack on the O:SOFTMAX:CE model. The attack achieves a success rate of 90.8%. Bottom. Adversarial images generated under the identical setting for the R:TANH:MSE model. The attack achieves a success rate of 2.3%.
Figure 3: Adversarial examples generated by our proposed attack () on the R:TANH:MSE model for test images from the CIFAR-10, MNIST, and Fashion-MNIST datasets.

5.6 Comparing upper Lipschitz bounds

Figure 4 compares the upper Lipschitz bounds of convolutional layers between the O:SOFTMAX:CE and O:SOFTMAX:MSE models. The upper bounds for the O:SOFTMAX:MSE model are consistently smaller than those of the O:SOFTMAX:CE model across each dataset up to a factor of three, supporting our hypothesis that models trained to minimize mean-squared error are more robust to small perturbations.

Figure 4: Upper Lipschitz bounds of convolutional layers of the O:SOFTMAX:CE and O:SOFTMAX:MSE models for each dataset.

6 Related work

Several defenses have also been proposed. To date, the most effective defense technique is adversarial training ([19], [37], [31], [34]), where the model is trained on a mix of clean and adversarial data. This has shown to provide a regularization effect that makes models more robust towards attacks.


proposed defensive distillation, a mechanism whereby a model is trained based on soft labels generated by another ‘teacher’ network in order to prevent overfitting. Other methods include introducing randomness to or applying transformations on the input data and/or the layers of the network (

[11], [8], [28], [39]). However, [3] have identified that the apparent robustness of several defenses can be attributed to the introduction of computation and transformations that mask the gradients and thus break existing attacks that rely on gradients to generate adversarial examples. Their work demonstrates that small, tailored modifications to the attacks can circumvent these defenses completely.

7 Conclusion

We have reported interesting experimental results demonstrating the adversarial robustness of models that do not follow conventional specifications. We have observed that simply changing the loss function that is minimized during training can greatly impact the robustness of a neural network against adversarial attacks. Our evaluation strategy is manifold, consisting of existing attacks, new attacks adjusted to our proposed modifications, and a spectral analysis of the model’s parameters. The increase in robustness observed from experimental results suggests the importance of considering alternatives to conventional design choices when making neural networks more secure. Future work would involve further investigation into the reasons for such modifications to improve the robustness of neural networks.