Deep learning models have shown impressive performance in a myriad of classification tasks (lecun2015deep). Despite their success, deep neural image classifiers are found to be easily fooled by visually imperceptible adversarial perturbations (szegedy2013intriguing). These perturbations can be crafted to reduce accuracy during test time or veer predictions towards a target class. This vulnerability not only poses a security risk in using neural networks in critical applications like autonomous driving (bojarski2016end) but also presents an interesting research problem about how these models work.
Many adversarial attacks have come into the scene (carlini2017towards; papernot2018cleverhans; croce2019minimally), not without defenses proposed to counter them (gowal2018effectiveness; zhang2019theoretically). Among them, the best defenses are based on adversarial training (AT) where models are trained on adversarial examples to better classify adversarial examples during test time (madry2017towards). While several effective defenses that employ adversarial examples have emerged (qin2019adversarial; shafahi2019adversarial), generating strong adversarial training examples adds non-trivial computational burden on the training process (kannan2018adversarial; xie2019feature).
Adversarially trained models gain robustness and are also observed to produce more salient Jacobian matrices (Jacobians) at the input layer as a side effect (tsipras2018robustness). These Jacobians visually resemble their corresponding images for robust models but look much noisier for standard non-robust models. It is shown in theory that the saliency in Jacobian is a result of robustness (etmann2019connection). A natural question to ask is this: can an improvement in Jacobian saliency induce robustness in models? In other words, could this side effect be a new avenue to boost model robustness? To the best of our knowledge, this paper is the first to show affirmative findings for this question.
To enhance the saliency of Jacobians, we draw inspirations from neural generative networks (StarGAN2018; dai2019diagnosing). More specifically, in generative adversarial networks (GANs) (goodfellow2014generative), a generator network learns to generate natural-looking images with a training objective to fool a discriminator network. In our proposed approach, Jacobian Adversarially Regularized Networks (JARN), the classifier learns to produce salient Jacobians with a regularization objective to fool a discriminator network into classifying them as input images. This method offers a new way to look at improving robustness without relying on adversarial examples during training. With JARN, we show that directly training for salient Jacobians can advance model robustness against adversarial examples in the MNIST, SVHN and CIFAR-10 image dataset. When augmented with adversarial training, JARN can provide additive robustness to models thus attaining competitive results. All in all, the prime contributions of this paper are as follows:
We show that directly improving the saliency of classifiers’ input Jacobian matrices can increase its adversarial robustness.
To achieve this, we propose Jacobian adversarially regularized networks (JARN) as a method to train classifiers to produce salient Jacobians that resemble input images.
Through experiments in MNIST, SVHN and CIFAR-10, we find that JARN boosts adversarial robustness in image classifiers and provides additive robustness to adversarial training.
2 Background and Related Work
Given an input , a classifier
maps it to output probabilities forclasses in set , where is the classifier’s parameters and is the one-hot label for the input. With a training dataset , the standard method to train a classifier is empirical risk minimization (ERM), through , where
is the standard cross-entropy loss function defined as
While ERM trains neural networks that perform well on holdout test data, their accuracy drops drastically in the face of adversarial test examples. With an adversarial perturbation of magnitude at input , a model is robust against this attack if
We focus on in this paper.
To improve models’ robustness, adversarial training (AT) (goodfellow2016deep) seek to match the training data distribution with the adversarial test distribution by training classifiers on adversarial examples. Specifically, AT minimizes the loss function:
where the inner maximization, , is usually performed with an iterative gradient-based optimization. Projected gradient descent (PGD) is one such strong defense which performs the following gradient step iteratively:
where . The computational cost of solving Equation (3) is dominated by the inner maximization problem of generating adversarial training examples. A naive way to mitigate the computational cost involved is to reduce the number gradient descent iterations but that would result in weaker adversarial training examples. A consequence of this is that the models are unable to resist stronger adversarial examples that are generated with more gradient steps, due to a phenomenon called obfuscated gradients (carlini2017towards; uesato2018adversarial).
Since the introduction of AT, a line of work has emerged that also boosts robustness with adversarial training examples. Capturing the trade-off between natural and adversarial errors, TRADES (zhang2019theoretically) encourages the decision boundary to be smooth by adding a regularization term to reduce the difference between the prediction of natural and adversarial examples. qin2019adversarial
seeks to smoothen the loss landscape through local linearization by minimizing the difference between the real and linearly estimated loss value of adversarial examples. To improve adversarial training,zhang2019defense generates adversarial examples by feature scattering, i.e., maximizing feature matching distance between the examples and clean samples.
tsipras2018robustness observes that adversarially trained models display an interesting phenomenon: they produce salient Jacobian matrices () that loosely resemble input images while less robust standard models have noisier Jacobian. etmann2019connection explains that linearized robustness (distance from samples to decision boundary) increases as the alignment between the Jacobian and input image grows. They show that this connection is strictly true for linear models but weakens for non-linear neural networks. While these two papers show that robustly trained models result in salient Jacobian matrices, our paper aims to investigate whether directly training to generate salient Jacobian matrices can result in robust models.
Non-Adversarial Training Regularization
Provable defenses are first proposed to bound minimum adversarial perturbation for certain types of neural networks (hein2017formal; raghunathan2018semidefinite). One of the most advanced defense from this class of work (wong2018scaling)
uses a dual network to bound the adversarial perturbation with linear programming. The authors then optimize this bound during training to boost adversarial robustness. Apart from this category, closer to our work, several works have studied a regularization term on top of the standard training objective to reduce the Jacobian’s Frobenius norm. This term aims to reduce the effect input perturbations have on model predictions.drucker1991double
first proposed this to improve model generalization on natural test samples and called it ‘double backpropagation’. Two subsequent studies found this to also increases robustness against adversarial examplesross2018improving; jakubovitz2018improving. Recently, hoffman2019robust proposed an efficient method to approximate the input-class probability output Jacobians of a classifier to minimize the norms of these Jacobians with a much lower computational cost. simon2019first proved that double backpropagation is equivalent to adversarial training with examples. etmann2019connection trained robust models using double backpropagation to study the link between robustness and alignment in non-linear models but did not propose a new defense in their paper. While the double backpropagation term improves robustness by reducing the effect that perturbations in individual pixel have on the classifier’s prediction through the Jacobians’ norm, it does not have the aim to optimize Jacobians to explicitly resemble their corresponding images semantically. Different from these prior work, we train the classifier with an adversarial loss term with the aim to make the Jacobian resemble input images more closely and show in our experiments that this approach confers more robustness.
3 Jacobian Adversarially Regularized Networks (JARN)
Robustly trained models are observed to produce salient Jacobian matrices that resemble the input images. This begs a question in the reverse direction: will an objective function that encourages Jacobian to more closely resemble input images, will standard networks become robust? To study this, we look at neural generative networks where models are trained to produce natural-looking images. We draw inspiration from generative adversarial networks (GANs) where a generator network is trained to progressively generate more natural images that fool a discriminator model, in a min-max optimization scenario (goodfellow2014generative). More specifically, we frame a classifier as the generator model in the GAN framework so that its Jacobians can progressively fool a discriminator model to interpret them as input images.
Another motivation lies in the high computational cost of the strongest defense to date, adversarial training. The cost on top of standard training is proportional to the number of steps its adversarial examples take to be crafted, requiring an additional backpropagation for each iteration. Especially with larger datasets, there is a need for less resource-intensive defense. In our proposed method (JARN), there is only one additional backpropagation through the classifier and the discriminator model on top of standard training. We share JARN in the following paragraphs and offer some theoretical analysis in § 3.1.
Jacobian Adversarially Regularized Networks
Denoting input as for -size images with
channels, one-hot label vector ofclasses as , we express as the prediction of the classifier (), parameterized by . The standard cross-entropy loss is
With gradient backpropagation to the input layer, through with respect to , we can get the Jacobian matrix as:
where . The next part of JARN entails adversarial regularization of Jacobian matrices to induce resemblance with input images. Though the Jacobians of robust models are empirically observed to be similar to images, their distributions of pixel values do not visually match (etmann2019connection). The discriminator model may easily distinguish between the Jacobian and natural images through this difference, resulting in the vanishing gradient (arjovsky2017wasserstein) for the classifier train on. To address this, an adaptor network () is introduced to map the Jacobian into the domain of input images. In our experiments, we use a single 1x1 convolutional layer with activation function to model , expressing its model parameters as . With the as the input of , we get the adapted Jacobian matrix ,
We can frame the classifier and adaptor networks as a generator
learning to model distribution of that resembles .
We now denote a discriminator network, parameterized by , as that outputs a single scalar. represents the probability that came from training images rather than . To train to produce that perceive as natural images, we employ the following adversarial loss:
Combining this regularization loss with the classification loss function in Equation (5
), we can optimize through stochastic gradient descent to approximate the optimal parameters for the classifieras follows,
where control how much Jacobian adversarial regularization term dominates the training.
Since the adaptor network () is part of the generator , its optimal parameters can be found with minimization of the adversarial loss,
On the other hand, the discriminator () is optimized to maximize the adversarial loss term to distinguish Jacobian from input images correctly,
Analogous to how generator from GANs learn to generate images from noise, we add uniformly distributed noise to input image pixels during JARN training phase. Figure 1 shows a summary of JARN training phase while Algorithm 1
details the corresponding pseudo-codes. In our experiments, we find that using JARN framework only on the last few epoch (25%) to train the classifier confers similar adversarial robustness compared to training with JARN for the whole duration. This practice saves compute time and is used for the results reported in this paper.
3.1 Theoretical Analysis
Here, we study the link between JARN’s adversarial regularization term with the notion of linearized robustness. Assuming a non-parameteric setting where the models have infinite capacity, we have the following theorem while optimizing with the adversarial loss .
The global minimum of is achieved when maps to itself, i.e., .
Its proof is deferred to § A. If we assume Jacobian of our classifier to be the direct output of , then at the global minimum of the adversarial objective.
In etmann2019connection, it is shown that the linearized robustness of a model is loosely upper-bounded by the alignment between the Jacobian and the input image. More concretely, denoting
as the logits value of classin a classifier , its linearized robustness can be expressed as . Here we quote the theorem from etmann2019connection:
Theorem 3.2 (Linearized Robustness Bound).
(etmann2019connection) Defining and as top two prediction, we let the Jacobian with respect to the difference in top two logits be . Expressing alignment between the Jacobian with the input as , then
where is a positive constant.
Combining with what we have in Theorem 3.1, assuming to be close to in a fixed constant term, the alignment term in Equation (13) is maximum when reaches its global minimum. Though this is not a strict upper bound and, to facilitate the training in JARN in practice, we use an adaptor network to transform the Jacobian, i.e., , our experiments show that model robustness can be improved with this adversarial regularization.
We conduct experiments on three image datasets, MNIST, SVHN and CIFAR-10 to evaluate the adversarial robustness of models trained by JARN.
MNIST consists of 60k training and 10k test binary-colored images. We train a CNN, sequentially composed of 3 convolutional layers and 1 final softmax layer. All 3 convolutional layers have a stride of 5 while each layer has an increasing number of output channels (64-128-256). For JARN, we use, a discriminator network of 2 CNN layers (64-128 output channels) and update it for every 10 training iterations. We evaluate trained models against adversarial examples with perturbation , crafted from FGSM and PGD (5 & 40 iterations). FGSM generates weaker adversarial examples with only one gradient step and is weaker than the iterative PGD method.
The CNN trained with JARN shows improved adversarial robustness from a standard model across the three types of adversarial examples (Table 1). In the MNIST experiments, we find that data augmentation with uniform noise to pixels alone provides no benefit in robustness from the baseline.
SVHN is a 10-class house number image classification dataset with 73257 training and 26032 test images, each of size
. We train the Wide-Resnet 32-10 model following hyperparameters from(madry2017towards)’s setup for their CIFAR-10 experiments. For JARN, we use , a discriminator network of 5 CNN layers (16-32-64-128-256 output channels) and update it for every 20 training iterations. We evaluate trained models against adversarial examples with (), crafted from FGSM and 5, 10, 20-iteration PGD attack.
Similar to the findings in § 4.1, JARN advances the adversarial robustness of the classifier from the standard baseline against all four types of attacks. Interestingly, uniform noise image augmentation increases adversarial robustness from the baseline in the SVHN experiments, concurring with previous work that shows noise augmentation improves robustness (ford2019adversarial).
CIFAR-10 contains colored images labeled as 10 classes, with 50k training and 10k test images. We train the Wide-Resnet 32-10 model using similar hyperparameters to (madry2017towards) for our experiments. Following the settings from madry2017towards, we compare with a strong adversarial training baseline (PGD-AT7) that involves training the model with adversarial examples generate with 7-iteration PGD attack. For JARN, we use , a discriminator network of 5 CNN layers (32-64-128-256-512 output channels) and update it for every 20 training iterations. We evaluate trained models against adversarial examples with (), crafted from FGSM and PGD (5, 10 & 20 iterations). We also add in a fast gradient sign attack baseline (FGSM-AT1) that generates adversarial training examples with only 1 gradient step. Though FGSM-trained models are known to rely on obfuscated gradients to counter weak attacks, we augment it with JARN to study if there is additive robustness benefit against strong attacks. We also implemented double backpropagation (drucker1991double; ross2018improving) to compare.
Similar to results from the previous two datasets, the JARN classifier performs better than the standard baseline for all four types of adversarial examples. Compared to the model trained with uniform-noise augmentation, JARN performs closely in the weaker FGSM attack while being more robust against the two stronger PGD attacks. JARN also outperforms the double backpropagation baseline, showing that regularizing for salient Jacobians confers more robustness than regularizing for smaller Jacobian Frobenius norm values. The strong PGD-AT7 baseline shows higher robustness against PGD attacks than the JARN model. When we train JARN together with 1-step adversarial training (JARN-AT1), we find that the model’s robustness exceeds that of strong PGD-AT7 baseline on all four adversarial attacks, suggesting JARN’s gain in robustness is additive to that of AT.
4.3.1 Generalization of Robustness
Adversarial training (AT) based defenses generally train the model on examples generated by perturbation of a fixed . Unlike AT, JARN by itself does not have as a training parameter. To study how JARN-AT1 robustness generalizes, we conduct PGD attacks of varying and strength (5, 10 and 20 iterations). We also include another PGD-AT7 baseline that was trained at a higher . JARN-AT1 shows higher robustness than the two PGD-AT7 baselines against attacks with higher values () across the three PGD attacks, as shown in Figure 2. We also observe that the PGD-AT7 variants outperform each other on attacks with values close to their training , suggesting that their robustness is more adapted to resist adversarial examples that they are trained on. This relates to findings by tramer2019adversarial which shows that robustness from adversarial training is highest against the perturbation type that models are trained on.
4.3.2 Loss Landscape
We compute the classification loss value along the adversarial perturbation’s direction and a random orthogonal direction to analyze the loss landscape of the models. From Figure 3, we see that the models trained by the standard and FGSM-AT method display loss surfaces that are jagged and non-linear. This explains why the FGSM-AT display modest accuracy at the weaker FGSM attacks but fail at attacks with more iterations, a phenomenon called obfuscated gradients (carlini2017towards; uesato2018adversarial) where the initial gradient steps are still trapped within the locality of the input but eventually escape with more iterations. The JARN model displays a loss landscape that is less steep compared to the standard and FGSM-AT models, marked by the much lower (1 order of magnitude) loss value in Figure 2(c). When JARN is combined with one iteration of adversarial training, the JARN-AT1 model is observed to have much smoother loss landscapes, similar to that of the PGD-AT7 model, a strong baseline previously observed to be free of obfuscated gradients. This suggests that JARN and AT have additive benefits and JARN-AT1’s adversarial robustness is not attributed to obfuscated gradients.
A possible explanation behind the improved robustness through increasing Jacobian saliency is that the space of Jacobian shrinks under this regularization, i.e., Jacobians have to resemble non-noisy images. Intuitively, this means that there would be fewer paths for an adversarial example to reach an optimum in the loss landscape, improving the model’s robustness.
4.3.3 Saliency of Jacobian
The Jacobian matrices of JARN model and PGD-AT are salient and visually resemble the images more than those from the standard model (Figure 4). Upon closer inspection, the Jacobian matrices of the PGD-AT model concentrate their values at small regions around the object of interest whereas those of the JARN model cover a larger proportion of the images. One explanation is that the JARN model is trained to fool the discriminator network and hence generates Jacobian that contains details of input images to more closely resemble them.
4.3.4 Compute Time
Training with JARN is computationally more efficient when compared to adversarial training (Table 4). Even when combined with FGSM adversarial training JARN, it takes less than half the time of 7-step PGD adversarial training while outperforming it in robustness.
4.3.5 Sensitivity to Hyperparameters
The performance of GANs in image generation has been well-known to be sensitive to training hyperparameters. We test JARN performance across a range of , batch size and discriminator update intervals that are different from § 4.3 and find that its performance is relatively stable across hyperparameter changes, as shown in Appendix Figure 5. In a typical GAN framework, each training step involves a real image sample and an image generated from noise that is decoupled from the real sample. In contrast, a Jacobian is conditioned on its original input image and both are used in the same training step of JARN. This training step resembles that of VAE-GAN (larsen2015autoencoding) where pairs of real images and its reconstructed versions are used for training together, resulting in generally more stable gradients and convergence than GAN. We believe that this similarity favors JARN’s stability over a wider range of hyperparameters.
4.3.6 Black-box Transfer Attacks
Transfer attacks are adversarial examples generated from an alternative, substitute model and evaluated on the defense to test for gradient masking (papernot2016transferability; carlini2019evaluating). More specifically, defenses relying on gradient masking will display lower robustness towards transfer attacks than white-box attacks. When evaluated on such black-box attacks using adversarial examples generated from a PGD-AT7 trained model and their differently initialized versions, both JARN and JARN-AT1 display higher accuracy than when under white-box attacks (Table 5). This demonstrates that JARN’s robustness does not rely on gradient masking. Rather unexpectedly, JARN performs better than JARN-AT1 under the PGD-AT7 transfer attacks, which we believe is attributed to its better performance on clean test samples.
In this paper, we show that training classifiers to give more salient input Jacobian matrices that resemble images can advance their robustness against adversarial examples. We achieve this through an adversarial regularization framework (JARN) that train the model’s Jacobians to fool a discriminator network into classifying them as images. Through our experiments in three image datasets, JARN boosts adversarial robustness of standard models and give competitive performance when added on to weak defenses like FGSM. Our findings open the viability of improving the saliency of Jacobian as a new avenue to boost adversarial robustness.
Appendix A Proof of Theorem 3.1
The global minimum of is achieved when maps to itself, i.e., .
From (goodfellow2014generative), for a fixed , the optimal discriminator is
We can include the optimal discriminator into Equation (9) to get
where and are the Kullback-Leibler and Jensen-Shannon divergence respectively. Since the Jensen-Shannon divergence is always non-negative, reaches its global minimum value of when . When , we get and consequently , thus completing the proof.